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(54) Tide: PARALLEL METHODS FOR GENOMIC ANALYSIS 
(57) Abstract 



The present invention provides parallel 
methods for determining nucleotide sequences 
and physical maps of polynucleotides associ- 
ated with sample tags. This information can be 
used to determine the chromosomal locations of 
sample-tagged polynucleotides. In one embod- 
iment, the polynucleotides are derived from ge- 
nomic DNA coupled to insertion elements. As a 
result, the invention also provides parallel meth- 
ods for locating the integration sites of insertion 
elements in the genome. 
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PARALLEL METHODS FOR GENOMIC ANALYSIS 
1. RELATED APPLICATION DATA 

This application claims priority to U.S. Application Serial Number 
60/105,914, filed October 28, 1998. 

5 2. FIELD OF THE INVENTION 

The present invention is related to the field of molecular biology, and provides 
parallel methods for nucleic acid sequencing, physical mapping and mapping 
insertion elements . 
3. BACKGROUND 

10 There are two methods in common use to sequence DNA: the chemical 

degradation method, e.g. , Maxam et ah, (1977) and the chain-termination method, 
e.g. , Sanger et ah, (1977). Efforts to improve DNA sequencing efficiency have 
resulted in numerous improvements in the chain-termination method. Automation of 
many steps in the process has produced significant improvements in sequencing 

15 throughput. Nevertheless, each template still is sequenced one at a time. 

Attempts have been made to introduce some parallel processing steps into the 
sequencing method. For example Church (1990) and Church et ah (1992) teach a 
strategy in which multiple templates are fragmented in a single tube by either the 
chain-termination or chemical-degradation sequencing methods. The fragments are 

20 separated on a gel and transferred to a solid membrane. Each template carries a 
unique tag and the fragments are visualized by hybridization with a unique 
oligonucleotide probe specific to each tag. The pattern of the fragments that hybridize 
to one specific oligonucleotide probe represent the sequence information from one 
template. Removal of the first oligonucleotide probe followed by hybridization of a 

25 second oligonucleotide probe reveals the sequence pattern from a different template. 
This method is limited by the requirement to maintain the pattern of fragments in 
order to extract the sequence information. Therefore, only one sequence can be read 
at a time; that is, this step in the method is sequential rather than parallel. There are 
inherent time constraints produced by this sequential step. In addition, the number of 

30 times any membrane can be "stripped" and reprobed is limited. For these reasons, the 
application of the method is limited in practice to collections of fewer than 50 
templates. 



PCT/US99/25037 




WO 00/24937 PCT/US99/25037 

2 

Other methods are described in the art which attempt to introduce parallelism 
into different stages of the sequencing protocol. Van Ness et aL (1997) describe the 
use of mass tags that can be detected by mass spectrometry. Different tags are attached 
to the 5' -end of a sequencing primer. Each tagged primer is used to sequence a 

5 different template by the chain- termination method. The different reactions are pooled 
and fractionated by size (i.e. sequencing products are collected from the end of a 
capillary electrophoresis device). The tags present in each fraction are assayed by 
mass spectrometry. This information is deconvoluted to reproduce the "sequence 
ladders" of the different templates. The method is limited by the number of different 

10 tags that can be synthesized. More importantly, the method is not parallel until the 
sequencing reactions are pooled. 

A variation of the Van Ness method is described by Wong (1999). He replaces 
the chemical tags attached to the 5'-end of a primer with nucleic acid tags. Again, 
individual sequencing reactions are pooled and fractionated by size. Instead of 

15 detection by mass spectrometry, the tags in each fraction are designed to be amplified 
and labeled in vitro (i.e. PCR) followed by hybridization to an array of 
oligonucleotides. Individual locations in the array will hybridize to different tags. A 
positive hybridization signal indicates the tag is present in the fraction. This 
information is deconvoluted to reveal the sequence ladders of the different templates. 

20 The possible number of different tags attached to the sequencing primer is far greater 
with Wong's method than the Van Ness method. However, Wong still describes a 
method that is not parallel until the sequencing reactions are pooled. Consequently, 
much of the labor associated with traditional sequencing protocols still is present in 
Wong's method. DNA must be prepared from individual clones, and separate 

25 sequencing reactions must be performed on each template. In a second embodiment, 
Wong attempts to introduce some parallelism into these steps. He attaches the tags to 
several different sequencing primers. The different primers hybridize to different 
vectors. Instead of sequencing one clone at a time, he makes separate libraries in each 
vector, pools one clone from each library and sequences them with the pooled 

30 primers. The sequencing products from different pools are then combined and 
fractionated by size. Each clone still requires its own uniquely tagged primer, but 
fewer sequencing reactions are needed. In theory, this same strategy can be applied to 
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the Van Ness mass-tag method, as described by Schmidt et aL (1999). Presumably, 
the strategy will work for very small pools of primers, but as the collection of primers 
and vectors increases, mispriming events and failed sequences will predominate. In 
addition, single clones still are handled one at a time so considerable resources must 

5 be dedicated simply to producing, cataloging and storing the sequencing templates. 

Rabani (1996 and 1997) describes a sequencing method that employs the same 
tagged sequencing vectors used by Church (1990). A pool of templates with 
substantially different tags is sequenced with one primer as described in the Church 
patent. A label is incorporated into either the primer or the chain-terminator. The 

10 sequencing products are fractionated by size and immediately hybridized to an array 
of oligonucleotides (analogous to the array in Wong's method). Detection of the label 
at a particular location in the array indicates the presence of that tag in the fraction. 
The sequence ladders are deconvoluted as above. Though parallel at each step, in 
practice only a small number of samples can be pooled. A small amount of labeled 

15 material is available in each fraction for hybridization to the array. This material will 
determine the rate of hybridization and limits of detection. A very sensitive 
oligonucleotide array can detect about 0.1 femtomoles of a complementary 
polynucleotide, see Lockhart et aL (1996). Assuming each tag is present in about 1000 
bands of a sequencing ladder, then at least 0.1 picomoles of any tagged clone must be 

20 present in the pool before sequencing. A typical sequencing reaction uses about 0.5 
picomoles DNA. This calculation suggests a starting pool of about five clones may be 
sequenced according to Rabani' s method. 

Thus, there is a need in the art for a highly parallel sequencing method that is 
not limited by any sequential "bottlenecks" described above. The sequencing method 

25 would result in significant improvements in sequencing throughput and substantial 
reductions in the cost of sequencing. 

To sequence very large genomes, the DNA first must be broken down into 
smaller, more manageable clones. The determination of the overlap relationships of 
these smaller clones is needed to simplify the reconstruction of the entire sequence. 

30 The method most frequently used is "Sequence Tagged Site" (STS) content mapping. 
This method involves finding many small regions of single copy DNA (i.e. . STS's) 
and determining which clones contain the same STS's. Two clones that contain the 
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same STS must overlap. Detection of the STS is achieved by amplifying pools of 
clones with the polymerase chain reaction. This mapping process is very expensive 
and time consuming. 

Ultimately, the physical mapping and sequencing of organisms is designed to 
hasten the discovery of gene function. A general step in this process is to observe the 
phenotype of the null mutant. Through "reverse genetics" it is possible to "knockout" 
the function of a cloned gene to produce the null phenotype. Usually, gene knockouts 
are produced one at a time at great expense by introducing foreign DNA into the gene. 
Even efforts to apply reverse genetics to many cloned genes simply scale up the serial 
one-by-one approach. 

For these reasons, there is a need in the art for a method that introduces 
massive parallelism into the processes of sequencing, physical mapping and the 
production of gene knockouts. The present invention provides these and other 
advantages, as described in greater detail below. 
4. BRIEF DESCRIPTION OF THE FIGURES 

FIG. la is a drawing of a preferred embodiment of a sample tag joined to a 
sample polynucleotide. 

FIG. lb is a drawing of a preferred embodiment of sequencing primers and 
amplification primers for preparing and analyzing sequencing reaction products that 
are pooled prior to fractionation. 

FIG. 2 is a drawing of a preferred embodiment of an insertion element 
comprising a sample tag and a preferred embodiment of a method to rescue junctions. 

FIG. 3 is a preferred embodiment of a vector for sequencing or constructing 
physical maps from both ends of a sample polynucleotide. 

FIG. 4 is a flow chart of a preferred method for sequencing. 

FIG. 5 is a flow chart of a preferred method for constructing physical maps. 

FIG. 6 is a flow chart of a preferred method for producing cells containing 
located insertion elements. 
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FIG. 7a is a photograph of an autoradiogram of multiplexed sequencing 
reactions separated on a denaturing polyacrylamide gel. 

FIG. 7b is a photograph of an autoradiogram that served as a template for 
fractionating multiplexed sequencing reactions. 
5 FIG. 8 are the readout from a multiplex sequencing experiment. 

5. SUMMARY 

It is an object of the invention to provide massively parallel methods for 
generating nucleic acid sequence information from a collection of polynucleotides. 
More specifically, the method employs Sanger or Maxam and Gilbert nucleic acid 
10 sequencing reactions carried out on a collection of sample polynucleotides cloned into 
sample-tagged vectors so that a sample tag preferably is joined to one sample 
polynucleotide. The sample tags are used to deconvolute the sequence information 
derived from the different sample polynucleotides. Deconvolution is achieved 
through hybridization of size-separated products from the sequencing reaction to an 
15 array of tag complements. 

It is another object of the invention to provide a kit for carrying out the 
disclosed massively parallel sequencing methods. The kit preferably contains a library 
of sample-tagged cloning vectors in which the target nucleic acid whose sequence is 
sought may be cloned, enzymes for cloning the target into the cloning vectors, 
20 reagents for carrying out the sequencing reactions, reagents for amplifying the sample 
tags, an array of tag complements, and instructions for carrying out the method. 

It is a further object of the invention to provide methods and kits for carrying 
out the disclosed methods of physical mapping and generating gene knockouts. These 
methods and kits are based upon the reagents and principles analogous to those used 
25 for the massively parallel sequencing methods, as described below. 

6. DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
6.1 DEFINITIONS 
A "sequence element" or "element" as used herein in reference to a 
polynucleotide is a number of contiguous bases or base pairs in the polynucleotide, up 
30 to and including the complete polynucleotide. When referring to a sequence element 
with a particular property, the sequence element consists of the bases or base pairs that 
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contribute to the property or are defined by the property. 

The term "sample" as used herein refers to a polynucleotide or that element of 
a polynucleotide which will be analyzed for some property according to the method of 
this invention. For example, a sample polynucleotide may be joined to other sequence 

5 elements to form a larger polynucleotide in order to practice the invention. The 
element of the larger polynucleotide that is homologous to the sample polynucleotide 
is the "sample element" or "sample sequence element". 

A "sample tag" refers to a sequence element used to identify or distinguish 
different sample polynucleotides, sequence elements or clones present as members of 

10 a collection. In general, an individual sample tag is joined to an individual 
polynucleotide resulting in a collection of "sample- tagged" polynucleotides 
comprising distinct sample tags. A sample-tagged polynucleotide may comprise one 
or more distinct sample tags, which are used to distinguish different segments of the 
polynucleotide. For example, sample tags may be present at the 5' and 3' ends of the 

15 polynucleotide, or different tags may be distributed at multiple sites in the 
polynucleotide. The same sample polynucleotide may be associated with more than 
one sample tag, but to be informative, one sample tag must be associated with only 
one sample polynucleotide in a collection. It is these informative associations that 
constitute sample-tagged clones. Methods for designing sample tags are well known 

20 in the art as exemplified by, e.g.. Brenner (1997b). In some embodiments of the 
invention, the sample tags may comprise individual synthetic oligonucleotides each of 
which has been ligated into a vector, to provide a library or collection of vectors with 
distinct sample tags or the oligonucleotides are ligated directly to the polynucleotides 
to be analyzed. In other embodiments, the sample tag may comprise part of the 

25 sample sequence element. 

"Tagged" as used herein in reference to a polynucleotide means the 
polynucleotide is derived in one or more steps from a sample-tagged polynucleotide 
by for example enzymatic, chemical or mechanical means, and the polynucleotide 
comprises a tag. The "tag" is a sequence element that corresponds to a sample tag and 

30 can be used to identify or distinguish the sample tag. Note a sequence element is itself 
a tag if it is derived from a tag and can be used to identify or distinguish the tag. In 
many embodiments, the tag and the sample tag are identical. In certain embodiments, 
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the tag comprises the sample tag but contains additional sequence elements. The 
additional sequence elements may be necessary for example to permit increased 
hybridization temperatures or to impose structural constraints on the tag. In other 
embodiments, the sample tag comprises the tag but contains additional sequence • 
5 elements. For example, two different sample tags that share the same tag may be 
distinguished by preferential PCR amplification of the tag with primers that are 
specific to only one tag. Subsequent removal of the priming sequences produces 
identical tags that can be used to distinguish the different sample tags. During 
amplification or another step in the invention, the tag could lose all sequence identity 
10 with the sample tag. Nevertheless, as long as there exists an identifiable 
correspondence between the two, information associated with the tag can be related to 
the sample tag which in turn can be related to the sample polynucleotide. The number 
of distinct tags required to characterize a collection of sample-tagged polynucleotides 
will vary. In some embodiments, a one-to-one relationship exists between the tag and 
15 the sample tag. In other embodiments, the tags will identify information in addition to 
the sample identity, for example the terminating nucleotide, the restriction site, etc. 
Consequently, more distinct tags than distinct sample tags may be used. Finally as 
outlined above, the same tag may be used to identify more than one sample tag. 

A "tag complement" as used herein refers to a molecule that will substantially 
20 hybridize to only one tag, or a set of distinguishable tags, among a collection of tags 
under the appropriate conditions. Different tags that hybridize to the same tag 
complement may be distinguished for example by different fluorophores, by their 
ability to hybridize to a second oligonucleotide, etc. Some degree of cross- 
hybridization by otherwise distinguishable tags can be tolerated, provided the signal 
25 arising from hybridization between a tag A and its tag complement A* is discernable 
from the cross-hybridization signal arising from hybridization between a different tag 
B and the tag complement A'. In embodiments where the tag complement is a 
polynucleotide or sequence element, preferably the tag is perfectly matched to the tag 
complement. In embodiments where specific hybridization results in a triplex, the tag 
30 may be selected to be either double stranded or single stranded. Thus, where triplexes 
are formed, the term "complement" is meant to encompass either a double stranded 
complement of a single stranded tag or a single stranded complement of a double 
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stranded tag. Tag complements need not be polynucleotides. For example, RNA and 
single-stranded DNA are known to adopt sequence dependent conformations and will 
specifically bind to polypeptides and other molecules (Gold et aL, 1993 & 1995). 

The terms "oligonucleotide" or "polynucleotide" as used herein include linear 

5 oligomers of natural or modified monomers or linkages, including 
deoxyribonucleosides, ribonucleosides, a-anomeric forms thereof, peptide nucleic 
acids (PNAs), and the like, capable of specifically binding under the appropriate 
conditions to a target polynucleotide by way of a regular pattern of monomer-to- 
monomer interactions, such as Watson-Crick type of base pairing, base stacking, 

10 Hoogsteen or reverse Hoogsteen types of base pairing, or the like. Usually monomers 
are linked by phosphodiester bonds or analogs thereof to form "oligonucleotides" 
ranging in size from a few monomelic units, e^g., 3-4, to several tens of monomeric 
units, and "polynucleotides" are larger. However the usage of the terms 
"oligonucleotides" and "polynucleotides" in the art overlaps and varies. The terms are 

15 used interchangeably herein. Whenever a polynucleotide is represented by a sequence 
of letters, such as "ATGCCTG," it will be understood that the nucleotides are in 
5'->3' order from left to right and that "A" denotes deoxyadenosine, "C" denotes 
deoxycytidine, "G" denotes deoxyguanosine, and "T" denotes thymidine, unless 
otherwise noted. Analogs of phosphodiester linkages include phosphorothioate, 

20 phosphorodithioate, phosphoranilidate, phosphoramidate, and the like. It is clear to 
those skilled in the art when polynucleotides having natural or non-natural nucleotides 
may be employed. Polynucleotides or oligonucleotides can be single-stranded or 
double-stranded. 

As used herein, the term "polypeptide" is intended to include compounds 
25 composed of amino acid residues linked by amide bonds. Although "protein" is often 
used in reference to relatively large polypeptides, and "peptide" is often used in 
reference to small polypeptides, usage of these terms in the art overlaps and varies. 
The term "polypeptide" as used herein thus refers interchangeably to peptides, 
polypeptides and proteins, unless otherwise noted or clear from the context. The term 
30 "polypeptide" is further intended to encompass polypeptide analogues, polypeptide 
derivatives and peptidomimetics that mimic the chemical structure of a polypeptide 
composed of naturally-occurring amino acids. Thus a "polypeptide" encoded in a 
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polynucleotide is meant to include the polypeptide determined by the genetic code and 
these synthetic mimics. Examples of polypeptide analogues include polypeptides 
comprising one or more non-natural amino acids. Examples of polypeptide derivatives 
include polypeptides in which an amino acid side chain, the polypeptide backbone, or 
5 the amino- or carboxy-terminus has been derivatized (e.g., peptidic compounds with 
methylated amide linkages). Examples of peptidomimetics include peptidic 
compounds in which the polypeptide backbone is substituted with one or more 
benzodiazepine molecules (see e^ James, et aL 1993), "inverso" polypeptides in 
which all L-amino acids are substituted with the corresponding D-amino acids, "retro- 
10 inverso" polypeptides (see e^, Sisto et aL 1985) in which the sequence of amino 
acids is reversed ("retro") and all L-amino acids are replaced with D-amino acids 
("inverso") and other isosteres, such as polypeptide back-bone (Le., amide bond) 
mimetics, including modifications of the amide nitrogen, the a-carbon, amide 
carbonyl, complete replacement of the amide bond, extensions, deletions or backbone 
15 crosslinks. Several peptide backbone modifications are known, including y[CH 2 S]ij/, 
H/[CH 2 NH], m>[CSNH 2 ], i|/[NHCO], m>[COCH 2 ], and y[(E) or (Z) CH=CH]. In the 
nomenclature used above, y indicates the absence of an amide bond. The structure 
that replaces the amide group is specified within the brackets. Other possible 
modifications include an N-alkyl (or aryl) substitution (v|/[CONR]), backbone 
20 crosslinking to construct lactams and other cyclic structures, and other derivatives 
including C-terminal hydroxymethyl derivatives, O-modified derivatives and N- 
terminally modified derivatives including substituted amides such as alkylamides and 
hydrazides. 

"Perfectly matched" or "perfectly complementary" in reference to a duplex 
25 means that the poly- or oligonucleotide strands making up the duplex form a double 
stranded structure with one another such that every nucleotide in each strand 
undergoes Watson-Crick base pairing with a nucleotide in the other strand. The term 
also comprehends the pairing of nucleoside analogs, such as deoxyinosine, 
nucleosides with 2-aminopurine bases, and the like, that may be employed. In 
30 reference to a triplex, the term means that the triplex consists of a perfectly matched 
duplex and a third strand in which every nucleotide undergoes Hoogsteen or reverse 
Hoogsteen association with a base pair of the perfectly matched duplex. Conversely, a 
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"mismatch" in a duplex between a tag and an oligonucleotide means that a pair or 
triplet of nucleotides in the duplex or triplex fails to undergo Watson-Crick and/or 
Hoogsteen and/or reverse Hoogsteen bonding. 

As used herein, "nucleoside" and "nucleotide" include the natural nucleosides 

5 and nucleotides, including 2'-deoxy and 2'-hydroxyl forms, e.g., as described in 
Komberg et aL (1992). "Natural nucleotide" as used herein refers to the four common 
natural deoxynucleotides A, C, G, and T. "Analogs" in reference to nucleosides 
includes synthetic nucleosides having modified base moieties and/or modified sugar 
moieties, e^, described by Scheit (1980); Uhlman et ah (1990), or the like, with the 

10 only proviso that they are capable of specific hybridization. Such analogs include 
synthetic nucleosides designed to enhance binding properties, reduce complexity of 
probes, increase specificity, and the like. 

As used herein, "nucleic acid sequencing reaction" refers to a reaction that 
carried out on a polynucleotide clone will produce a collection of polynucleotides of 

15 differing chain length from which the sequence of the original nucleic acid can be 
determined. The term encompasses, e^, methods commonly referred to as "Sanger 
Sequencing," which uses dideoxy chain terminators to produce the collection of 
polynucleotides of differing length and variants such as "Thermal Cycle Sequencing", 
"Solid Phase Sequencing," exonuclease methods, and methods that use chemical 

20 cleavage to produce the collection of polynucleotides of differing length, such as 
Maxam-Gilbert and phosphothioate sequencing. These methods are well known in the 
art and are described in, e^ Ausubel, et aL (1997); Gish et aL (1988); Sorge et aL 
(1989); Li et al (1993); Porter et aL (1997). The term also includes methods based on 
termination of RNA polymerase (e.g. , Axelrod et aL, 1978). 

25 A "sequencing method" is a broad term that encompasses any reaction carried 

out on a polynucleotide to determine some sequence from the polynucleotide. The 
term encompasses nucleic acid sequencing reactions, sequencing by hybridization 
(Southern, 1997; Drmanac et aL, 1993; Khrapko et aL, 1996; Fodor et aL, 1999), step- 
wise sequencing (e.g. Cheeseman, 1994; Rosenthal, 1993; Brenner, 1998a), etc, 

30 A "sequence ladder" refers to a pattern of fragments from one clone resulting 

from the size separation and visualization of reaction products produced by a "nucleic 
acid sequencing reaction." Typically, size separation is accomplished by denaturing 
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gel electrophoresis. The nucleic acid sequence is ascertained by interpreting the 
"sequence ladder" to determine the identity of the 3' terminal nucleotides of reaction 
products that differ in length by one nucleotide. Generating and interpreting 
"sequence ladders" is well within the skill in the art, and is described in, e^, Ausubel 
et aL (1997). A "band" in a sequence ladder refers to the clonal population of reaction 
products that terminate at the same base and so migrate together through the 
separation medium. A band will have width due to dispersion and diffusion, so it is 
possible to speak of a part or portion of a band, which means a collection of the clonal 
population that has migrated more closely together than some other collection. 

A "primer" is a molecule that binds to a polynucleotide and enables a 
polymerase to begin synthesis of the daughter strand. For example, a primer can be a 
short oligonucleotide, a tRNA (e.g, Panet et aL, 1975) or a polypeptide (og, 
Guggenheimer et aL, 1984). A "primer binding site" is the sequence element to which 
the primer binds. 

A "sequencing primer" is an oligonucleotide that is hybridized to a 
polynucleotide clone to prime a nucleic acid sequencing reaction. The sequencing 
primer is prepared separately, usually on a DNA synthesizer and then combined with 
the polynucleotide. A "sequencing primer binding site" is the sequence element to 
which the sequencing primer hybridizes. The sequencing primer binding sites in two 
different polynucleotides are considered to be the same when the same sequencing 
primer will efficiently prime the nucleic acid sequencing reaction for both 
polynucleotides. Of course, mispriming frequently occurs during sequencing 
reactions, but these artifactual priming sites are minor components of the sequencing 
reaction products. One skilled in the art will readily understand the difference between 
mispriming and efficient priming at the sequencing primer binding site. 

"Deconvolving" means separating data derived from a plurality of different 
polynucleotides into component parts, wherein each component represents data 
derived from one of the polynucleotides comprising the plurality. 

An "array" refers to a solid support that provides a plurality of spatially 
addressable locations, referred to herein as features, at which molecules may be 
bound. The number of different kinds of molecules bound at one feature is small 
relative to the total number of different kinds of molecules in the array. In many 
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embodiments, only one kind of molecule (e^ oligonucleotide) is bound at each 
feature. Similarly, "to array" a collection of molecules means to form an array of the 
molecules. 

"Spatially addressable" means that the location of a molecule bound to the 
array can be recorded and tracked throughout any of the procedures carried out 
according to the method of the invention. 

A "contig" means a group of clones that represent overlapping regions of a 
genome. 

A "contig map" means a map depicting the relative order of a linked library of 
small overlapping clones representing a complete chromosomal segment. 

A "library" refers to a collection of polynucleotides. A particular library might 
include, for example, clones of all of the DNA sequences expressed in a certain kind 
of cell, or in a certain organ of the body, or a collection of man-made polynucleotides, 
or a collection of polynucleotides comprising combinations of naturally-occurring and 
man-made sequences. Polynucleotides in the library may be spatially separated, for 
example one clone per well of a microliter plate, or the library may comprise a pool of 
polynucleotides or clones. When a reaction is performed on a spatially separated 
library, the same reaction by definition must be performed separately on every 
member of the library. When a reaction is performed on a pooled library, the reaction 
need only be performed once. 

"Physical mapping" broadly refers to determining the locations of two or more 
landmarks in a polynucleotide segment. The term is meant to distinguish genetic 
mapping methods, which rely on a determination of recombination frequencies to 
estimate distance between two or more landmarks, from the methods of the present 
invention, which determine the actual linear distance between landmarks. Similarly, a 
"physical map" is the product of physical mapping. 

"Landmark" broadly refers to any distinguishable feature in a polynucleotide 
other than an unmodified nucleotide. Landmarks include, by way of example, 
restriction sites, single nucleotide polymorphisms, short sequence elements 
recognized by nucleic acid binding molecules, DNase hypersensitive sites, 
methylation sites, transposon, etc. This definition is meant to distinguish physical 
mapping from "sequencing", which refers to determining the linear order of 
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nucleotides in a polynucleotide. 

"Fingerprinting" refers to the use of physical mapping data to determine which 
nucleic acid fragments have a specific sequence (fingerprint) in common and therefore 
overlap. 

5 "Cloning" as used herein in reference to a polynucleotide refers to any method 

used to replicate a polynucleotide segment. The term encompasses cloning in vivo, 
which makes use of a cloning vector to carry inserts of the polynucleotide segment of 
interest, and what I refer to as cloning in vitro in which one or both strands of a 
polynucleotide segment of interest is replicated without the use of a vector. Cloning 
10 in vitro encompasses, for example, replication of a polynucleotide segment using 
PCR, linear amplification using a primer that recognizes a portion of the 
polynucleotide segment in conjunction with an enzyme capable of replicating the 
polynucleotide, in-vitro transcription, rolling circle replication, etc. Similarly, a 
"clone" in reference to a polynucleotide means a polynucleotide that has been 
15 replicated to produce a population of polynucleotides or sequence elements that share 
identical or substantially identical sequence. Substantial identity encompasses 
variations in the sequence of a polynucleotide that sometimes are introduced during 
PCR or other replication methods. This notion of substantial identity is well 
understood by those skilled in the art and it applies whenever the identity of 
20 polynucleotides is at issue. 

"Hybridization" as used herein refers to a sequence dependent binding 
interaction between at least one strand of a polynucleotide and another molecule. 
From the context, it is obvious to one skilled in the art whether a double-stranded 
polynucleotide must be denatured before the binding event. For example, the term 
25 includes Watson-Crick type base pairing, Hoogsteen and reverse Hoogsteen bonding, 
binding of an aptamer to its cognate molecule, etc, "Cross-hybridization" occurs when 
two distinct polynucleotides can bind to the same molecule or two distinct molecules 
can bind to the same polynucleotide. In general, cross-hybridization depends on the 
collection of polynucleotides (or molecules) since two polynucleotides (or molecules) 
30 cannot cross-hybridize if they are not in the same collection. Hybridization and cross- 
hybridization also may be used in reference to sequence elements. For example, two 
distinct polynucleotides may contain identical sample tags. The polynucleotides cross- 
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hybridize to the tag complement whereas the tags, being identical, do not cross 
hybridize. 

A "common sequence" or "common sequence element" refers to a sequence or 
sequence element that is or is intended to be present in every member of a collection 
5 of polynucleotides. 

The term "distinct" as used herein in reference to polynucleotides or sequence 
elements means that the sequences of the polynucleotides or sequence elements are 
not identical, 

A "pool" is a group of different molecules or objects that is combined together 
10 so that they are not isolated from one another and any operation performed on one 
member of the pool is by necessity performed on many members of the pool. For 
example, a pool of polynucleotides in solution is simply a plurality of different 
polynucleotides or clones mixed together in one solution; or each clone may be 
attached to a solid support, for example an array or a bead, in which case the pool 
15 consists of the clones combined together in one solution (e.g., the same fluid 
container). Similarly, "to pool" means to form a pool. 

An "aliquot" is a subdivision of a sample such that the composition of the 
aliquot is essentially identical to the composition of the sample. 

The term "to derive" as used herein in reference to polynucleotides means to 
20 generate one polynucleotide from another by any process, for example enzymatic, 
chemical or mechanical. The generated polynucleotide is "derived" from the other 
polynucleotide. 

The term "amplify" in reference to a polynucleotide means to use any method 
to produce multiple copies of a polynucleotide segment, called the "amplicon", by 

25 replicating a sequence element from the polynucleotide or by deriving a second 
polynucleotide from the first polynucleotide and replicating a sequence element from 
the second polynucleotide. The copies of the amplicon may exist as separate 
polynucleotides or one polynucleotide may comprise several copies of the amplicon. 
A polynucleotide may be amplified by, for example a polymerase chain reaction, in 

30 vitro transcription, rolling-circle replication, in vivo replication, etc. Frequently, the 
term "amplify" is used in reference to a sequence element in the amplicon. For 
example, one may refer to amplifying the tag in a polynucleotide by which is meant 



WO 00/24937 PCT/US99/25037 

amplifying the polynucleotide to produce an amplicon comprising the tag sequence 
element. The precise usage of amplify is clear from the context to one skilled in the 
art. 

The term "cleave" as used herein in reference to a polynucleotide means to 
5 perform a process that produces a smaller fragment of the polynucleotide. If the 
polynucleotide is double-stranded, only one of the strands may contribute to the 
smaller fragment. For example, physical shearing, endonucleases, exonucleases, 
polymerases, recombinases, topoisomerases, etc, will cleave a polynucleotide under 
the appropriate conditions. A "cleavage reaction" is the process by which a 
10 polynucleotide is cleaved. 

A "mapping reaction" as used herein refers to any reaction that can be carried 
out on a polynucleotide clone to generate a physical map or a nucleotide sequence of 
the clone. Similarly, a "map" is a physical map or a nucleotide sequence. 

The term "associating" as used herein in reference to a tagged polynucleotide 
15 with a property and a tag complement means determining that the polynucleotide 
hybridizes to the tag complement. In many embodiments, associating simply means 
hybridizing a polynucleotide with a known property to a tag complement and 
detecting the hybridization. In other embodiments, associating means detecting a 
property of a polynucleotide that is already hybridized to a tag complement. In both 
20 cases, the result is information that the polynucleotide has a certain property and in 
addition hybridizes to the tag complement. The properties of a polynucleotide can 
include for example the length, terminal base, terminal landmark or other properties 
according to this invention. 

A "junction" as used herein in reference to insertion elements is the DNA that 
25 flanks one side of the insertion element. 

A "clonal population" as used herein in reference to cells is a collection of 
cells that are substantially identical and originated from a single, isolated cell. 

An "array sequencing reaction" is any method that is used to determine 
sequences from a plurality of polynucleotides in an array, for example methods 
30 described by Brenner (1997c and 1998a), Brenner et ah (1998a), Cheeseman (1994), 
Drmanac et aL (1993), Pastinen et al (1997), Dubiley et ah (1997), Graber et aL 
(\999\ etc. 
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A "bioactive" compound is any compound, either man-made or natural, that 
has an observable effect on a cell or organism. The observable effect is the "biological 
activity" of the compound, 

A "daughter cell" as used herein in reference to a first cell is any descendent 
cell resulting from replication of the DNA of the first cell. The DNA of the daughter 
cell may not be identical to the first cell. For example, the daughter cell may result 
from mating two different cells (or organisms); additional DNA (for example 
transgenic DNA) or deletions may be present in the daughter cell; or the genome of 
the daughter cell may result from targeted homologous recombination, etc. 
6.2 Massively-parallel sequencing methods 
A collection of sample-tagged clones is prepared by joining a set of sample 
polynucleotides with a set of sample tags so that many of the sample tags (i^, 
preferably, at least approximately 35% of the total) are associated with unique sample 
polynucleotides. A preferred sample tag, as shown in Figure la, comprises a distinct 
sequence element 12 flanked on both sides by common regions 10 & 14 shared by the 
other clones. The sample sequence element 16 comprises the sample polynucleotide 
that is joined to the sample tag. A nucleic acid sequencing reaction is performed on 
the pooled collection of sample-tagged clones (i.e. , Sanger chain-termination method, 
Maxam & Gilbert chemical cleavage method, etc.) Typically, four separate reactions 
are performed, which correspond to the four (A, T, G, C) nucleotides. The Sanger 
method employs the sequencing primer 18, which hybridizes to the sequencing primer 
binding site in common region 10. In this example, only one sequencing primer 
binding site is needed for the sequencing reaction to be performed on the pool of 
sample-tagged clones. Of course, different collections of clones with different 
common regions comprising different sequencing primer binding sites may be pooled 
and more than one primer may be utilized, but preferably there will be many more 
sample-tagged clones than sequencing primer binding sites utilized in the sequencing 
reaction. One or a limited number of primer binding sites means only a small number 
of sequencing primers are required for the sequencing reaction, which produces 
efficient priming and limits spurious priming artifacts. 

The products of the sequencing reactions are separated by size and four sets of 
fractions are collected. Any method of separation may be used that sufficiently 




WO 00/24937 17 PCTAJS99/25037 

resolves the sequencing fragments (i.e. single nucleotide resolution) and permits 
collection of the fragments in a state compatible with subsequent analysis (ue. 
amplification and/or hybridization, see below). Representative methods include 
polyacrylamide gel electrophoresis, capillary electrophoresis, chromatography, etc. 

5 These methods are well known in the art and are described in, e^g. Ausubel et aL 
(1997), Landers (1996), and Thayer et aL (1996). Fractions may be collected, for 
example, by running the sequencing reactions off the bottom of a gel or column, or 
each lane of a gel may be sectioned in the direction normal to the direction of 
electrophoresis (i.e., transversely) and nucleic acid eluted from the sections. Ideally, 

10 each fraction corresponds to chain lengths that differ by one nucleotide and any one 
band is completely contained in only one fraction. Different clones however will 
display slight variations in band migrations, so fractions may contain only part of a 
band. Each fraction of terminated DNA fragments is made double-stranded using 
primer 20 in Figure la, which hybridizes to common region 14. The fractions are 

15 amplified to produce tagged amplicons comprising distinct sequence element 12. A 
preferred method of amplification is PCR with primer 18 and primer 20. Other 
methods of amplification are applicable as described below. The four sets of 
amplicons can be marked with four different labels (e.g., four different fluorophores), 
where each label corresponds to one of the terminating nucleotides. 

20 The amplicons are separately hybridized to an array of tag complements 

wherein for example each feature consists of oligonucleotides complementary to only 
one distinct sequence element 12. Alternatively, groups of four fractions that are 
marked with different labels and correspond to the same size (Le. the same distance or 
time of migration), are pooled and hybridized to the array. For each tag, the 

25 hybridization patterns and fraction numbers will identify the sequence of nucleotides 
in the polynucleotide joined to that tag. That is, the sequencing ladder for the sample- 
tagged polynucleotide can be reconstructed from the hybridization data and fraction 
numbers for the associated tag. The resolution of the sequencing ladders will improve 
with more fractions per band (i.e. smaller fraction size), but the tradeoff is that more 

30 hybridizations are needed to reconstruct the ladders Obviously, hybridization 
conditions and tag sequences must be chosen to minimize cross-hybridization between 
different tags. Methods for designing sample tags are well known, see for example, 
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Brenner (1997b). In this way, an array of oligos can be used to deconvolute 
sequences of the sample-tagged polynucleotides. 

A slight variation of the above technique will permit the four sequencing 
reactions to be run together in one lane. Thus, problems associated with lane-to-lane 
variation during gel electrophoresis are eliminated. The sequencing primer 18 can 
tolerate additional sequences at its 5'-end without influencing priming. Consider four 
different sequences of identical length (40, 42, 44 and 46) added to the 5'-end of 
primer 18 to make sequencing primers 30A, 30B, 30C and 30D as shown in Figure 
lb. The additional sequences are long enough so that their complements can act as 
primer binding sites for primers 50, 52, 54 and 56. Preferably the melting 
temperatures of these four primers are similar. Now, the four chain-terminating 
reactions are performed with the four different sequencing primers Qj^, ddATP 
reaction is primed by 30A, ddGTP is primed by 30B, etc.) The four reactions are 
pooled, separated by size and fractionated. Each fraction can be PCR amplified using 
five primers: 20, 50, 52, 54, and 56. Primers 50, 52, 54 and 56 are attached to 
different labels (Le^, different fluorophores). In this way, the tag is labeled according 
to the dideoxy terminator. Alternatively, each fraction can be amplified in four 
separate reactions with four primer pairs: 20 + 50, 20 + 52, 20 + 54, and 20 + 56, in 
which case the label need not be attached to the oligo. If RNA polymerase is used to 
amplify the tags, then sequences 40, 42, 44 and 46 may encode four separate RNA 
polymerase promoters (e^, T3, T7, SP6 & E. coli RNA polymerase). Alternatively, 
the four sequencing primers (30A, 30B, 30C and 30D) may comprise a single 
promoter located 5-prime to regions 40, 42, 44 and 46. In the latter case, the tags may 
be visualized after hybridization to the array in a subsequent hybridization with 
labeled oligonucleotides 50, 52, 54 and 56, wherein the oligonucleotides preferably 
comprise different labels. If PCR is used to amplify the tags, it is advantageous to 
attach a biotin (or analogous) group to primer 20 (or the primers opposite primer 20) 
so the complementary (and in some cases unlabeled) strand can be easily removed 
before hybridization to the array. Other methods that preferentially degrade one 
strand can also be employed (e.g., 5'-phosphate plus lambda exonuclease, see 
Ausubel, 1997). Not all sequences of equal length will work equally well for the 
sequence elements 40, 42, 44 and 46. Preferably, the sequencing primers 30A, 30B, 
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30C and 30D have minimal secondary structure so their contribution to the mobility 
of the reaction products during separation is based essentially entirely on length, that 
is the different primers contribute equally to mobility. 

An analogous strategy can be employed to permit pooling of nucleic acid 

5 sequencing products generated by the Maxam-Gilbert method (and other chemical 
cleavage methods). In this case, sequence elements that correspond to 40, 42, 44 and 
46 are ligated as adapters to the sample-tagged polynucleotides before or after the 
reactions, but prior to pooling and separation. That is, the adapter comprising 
sequence 40 is ligated to polynucleotides subjected to the "A + G" reaction, the 

10 adapter comprising sequence 42 is ligated to polynucleotides subjected to the "G" 
reaction, etc. 

Clearly, many different nucleic acid sequencing reactions can be used to 
practice this invention. The Sanger and Maxam-Gilbert methods are outlined above, 
but several variations are well known in the art. 

15 By including polynucleotides of known sequence attached to known sample 

tags as internal controls in the pool of sample-tagged polynucleotides, it is possible to 
determine the fraction number of any fraction based on the known sequence 
information of the controls. That is, the control sequence patterns and signal 
intensities can be used to calibrate the hybridization patterns from the array to 

20 facilitate reconstruction of the unknown sequencing ladders. Any variation in signal 
intensities from one hybridization to the next can be calculated and corrected by 
referring to the control sequence patterns. If 10 known fragments are included in the 
pool, then each fraction will show only one of 4 10 * one million possible hybridization 
patterns at the corresponding 10 locations on the array. In practice, this number is 

25 much greater because the hybridization signals will not have simple binary 
contributions from each base, but will display variable intensity depending on the 
amount of each band in the fraction. 

Denaturing polyacrylamide gels separate DNA principally by size. There are 
well known exceptions to this rule (e.g., compressions) that can affect DNA 

30 migration. In addition, there is a very slight dependence of mobility on the terminal 
3'-base. Therefore, sequencing bands corresponding to equivalent sizes need not be 
perfectly superimposed. The problem of reconstructing the DNA ladder from the 
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band intensities in each fraction becomes a problem of reconstructing a wave from 
sampled intervals along that wave (picture the readout from an ABI sequencer and this 
problem becomes clear). Obviously, the more fractions that one collects, the more 
information one has to reconstruct the parent wave. This is simply a problem in 
5 information theory, well known in the art, see for example Stockham et aL (1993), 
Allison et aL (1998), Fujiwara et al (1982), Johnson et ah (1994), and Press et ah 
(1988). By calibrating the fractionation apparatus and/or using internal standards, the 
appropriate sampling frequency is determined. The hybridization data provides 
information about the "amplitude" of each peak. By optimizing the gel conditions and 
10 stressing uniform band intensities and uniform spacing, it may be possible to obtain 
unambiguous sequence data with fewer fractions than bases. Very few template 
preparations and sequencing reactions are needed to obtain enormous amounts of 
sequence information so even elaborate protocols (e.g. , cesium banding, formamide 
gels, etc.) and a variety of nucleotide analogs (e.g. 7-deaza-dGTP and dlTP) can be 
15 used to produce optimally fractionated sequencing products. 

An important aspect of the method, is the ability to construct the clone 
libraries in a pool of sample tagged vectors (alternatively, the sample tags can be 
added as adapters and the clones ligated into the same vector, see Sagner et aL, 1998). 
This approach greatly reduces the effort involved in library construction, but comes at 
20 a cost of lost information per pool. For example, consider a library that consists of 
500,000 different sample tags. The effort would be enormous to make 500,000 
separate libraries and then to pick a single clone from each library. Instead, one 
library is constructed and about 500,000 transformants are pooled (a very trivial 
operation). Similarly, the library may be constructed entirely m vitro, and 500,000 
25 clones may be selected by amplifying in vitro a proper dilution of the library so that 
the amplicons comprise about 500,000 clones. However assuming a normal 
distribution, only a fraction (1/e = 0.37) of the sample tags is expected to be present 
only once in the library (he. attached to only one sample polynucleotide). 37% of the 
sample tags are expected to be absent from the collection and the remainder will be 
30 present two or more times (that is, two or more different polynucleotide clones will 
contain the same sample tag). Therefore, 63% of the original sample tags will provide 
no or garbled information. Those original sample tags providing garbled information 
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are readily recognized because more than a single base is identified at each position 
during the deconvolution step. This loss is well worth the savings in effort. Certain 
strategies can be used to increase the information content, such as using 5 million 
original sample tags and selecting only 500,000 clones, but if the maximum size of the 

5 array is 500,000 (close to the current Affymetrix array size), then either 10 arrays must 
be used per hybridization to extract about 90% of the information, or the subset of 
tags must be determined first and a new array synthesized that contains the 450,000 
unique sample tags. It may be possible to enrich for unique clones by sequentially 
hybridizing to the array plasmid DNA from smaller subsets of transformants (say 

10 50,000). In this way, a tag complement in the array becomes saturated and cannot 
hybridize to other plasmids that are present in subsequent pools. Plasmid DNA can be 
eluted from the array, transformed back into E. coH (or amplified in some other way) 
and sequenced. Of course, if smaller numbers of clones are to be sequenced, it is 
feasible to construct separate libraries for each tag and pool one member from each 

15 library before performing the sequencing reaction, or even separately perform the 
sequencing reaction on one member from each library and then pool the reaction 



Another important aspect of the invention is the ability to amplify the DNA in 
the fractions. The current limit of detection on an Affymetrix chip is 0.5 pM probe 

20 (see Lockhart et aL, 1996). Assuming a 200 ^1 hybridization volume, this equals 
(0.5xl0" 12 M)x(200xl0' 6 L)x(6xl0 23 molecules/mole) = 6xl0 7 molecules. Assuming 
1 jig of 3 kb plasmid pool is sequenced, then (10" 6 g / (3000x625M.W.)) x (6xl0 23 
molecules/mole) = 3.2xlO n molecules are divided among 500,000 different plasmids 
and 1000 different bands. Therefore (3.2xlO n ) / (500,000x1000) = 640 molecules of 

25 any one tag are expected to be present in any one band. In this case, an amplification 
factor of 6xl0 7 /640 ~ 100,000 is required. There are multiple strategies for converting 
the terminated sequencing fragments into a form compatible, with in vitro 
amplification. A number of well-known methods exist for converting single-stranded 
DNA into a double-stranded form. For example, random priming can be performed 

30 with a mixture of oligonucleotides that are identical except for several random bases 
near the 3'-ends. This method is well known in the art; see, e^, Telenius et aL 
(1992) and Cheung et ah (1996). PCR then can be performed with primer 18 and a 



products. 
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second oligonucleotide primer that is identical to the region shared by the random 
primers. Other forms of in vitro amplification are possible, such as linear 
amplification with RNA polymerase (assuming the double-stranded fragment contains 
a promoter), etc. Variations oh this strategy include the ligation of short double- 
5 stranded molecules (adapters) to randomly primed sequencing fragments. The second 
primer is designed to anneal to the adapter sequence (see section 6.7). 

Other strategies for amplifying the tag sequences may require removal of any 
unusual bases at the 3'-ends of the sequencing fragments (e.g., dideoxynucleotides). 
This step can be performed by limited digestion of the fragments with a 3'- 
10 exonuclease (e.g., Exonuclease I, T4 DNA Polymerase, etc.). Now, the linear 
fragments can be tailed with terminal transferase or even joined to another single- 
stranded fragment of known sequence through the action of T4 RNA ligase. In both 
cases, the known sequence (i.e. , polyA or the second fragment) can serve as a second 
priming site for PCR or other form of amplification. Alternatively, the digested 
15 sequencing fragments can be circularized with T4 RNA ligase. Inverse PCR can be 
performed on this circular substrate (Innis et al, 1990). The circles can also be 
amplified by a rolling-circle type amplification with a strand-displacing polymerase as 
disclosed in Lizardi et aL (1998) and Zhang et aL (1998). . 

It is possible in some embodiments to perform the nucleic acid amplification 
20 step after the sequencing fragments are hybridized to the oligonucleotide array. 
Adams et aL (1997) describe a method in which both PCR primers are attached to a 
solid substrate. Amplification occurs in a fashion similar to traditional PCR only the 
replicated molecules remain attached to the substrate. In this case, each feature (spot) 
in the array will contain one oligonucleotide that hybridizes to a particular tag and one 
25 common primer (e.g., primer 18, Fig la). Note, the sequencing fragments are not 
complementary to the common primer until the complementary strand is synthesized. 

A more preferred method of " in-situ " amplification is the rolling-circle type 
process mentioned above. In this case, the sequencing fragments can be converted to 
a circular form before hybridization to the array. The oligonucleotides in the array 
30 complementary to the tags will prime the rolling-circle replication. A second common 
primer can be provided in solution. Other variations of rolling circle amplification 
may be used. For example, consider a tag complement-sequencing fragment duplex. 
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The sequence upstream of the tag, including the sequencing primer, will be present as 
a 5 '-single-stranded extension of the duplex. A second oligonucleotide can hybridize 
to the overhang. T4 DNA ligase can join the tag complement to this second 
oligonucleotide. The second oligonucleotide can then serve as a primer for rolling 
circle amplification. In this case, the circular substrate is a common molecule that is 
amplified wherever in the array hybridization of the sequencing fragments has 
occurred. For a discussion of rolling circle amplification see Lizardi et aL (1998) and 
Zhang etaL (1998). 

The preferred sample tag shown in Figure la is joined to the sample 
polynucleotide in vitro (i.e. it is an "adapter tag"). In certain instances, the sample tag 
may be a "genomic tag" that comprises a sequence element from the sample 
polynucleotide. For example, consider an array made by separately PCR-amplifying 
individual clones from a library (for example, cDNA clones) and spotting the clones 
on a glass slide (see for example Brown et aL, 1998). All the clones are amplified 
with the same two vector primers. The PCR amplicons may be pooled and sequenced 
as follows. In this example, the pool of cDNA clones is sequenced with one of the 
PCR primers by the "Sanger" method. The reaction products are separated and 
fractionated as described above. Now, the fractionated products are amplified in vitro 
to generate amplicons comprising sequences from the cDNA (see section 6.7.1.2). 
These amplicons are hybridized to the spotted array to reconstruct the sequence 
ladders from individual clones as described above. Note, the cDNA clones comprise 
the tag complement sequences. Arrays of the cDNA clones constructed by other 
methods also are suitable (see section 6.8 below). Of course, the common sequences 
shared by all the clones should be removed from the amplicons prior to hybridization 
(or removed from the cDNA clones prior to spotting) to minimize cross-hybridization. 
This step is trivial if a restriction site separates the sample sequence elements from the 
common elements (i.e. . the cDNA clones were ligated into a cloning vector at a 
restriction site). This method of sequencing with genomic tags is preferred when a 
library cannot be easily remade or "retrofit" with the adapter tags shown in Figure la. 
6.3 Massively-parallel physical mapping methods 
The sequencing method described above is a parallel method for fragmenting a 
polynucleotide at its bases, determining the size of each fragment and thereby 
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determining the linear order of the bases. However, a polynucleotide can be 
fragmented at features other than single bases. These features, or landmarks, include 
for example restriction sites, DNA hypersensitive sites, recognition sites for DNA 
binding proteins, methylation sites or indeed any region of DNA that can be 
5 preferentially nicked or cut or otherwise used to define the length of a polynucleotide 
fragment. For example, the lac repressor binding site can be used as a landmark for 
directly cutting the DNA with a lac repressor coupled to EDTA Fe (Shin et al, 1991). 
This site can be used in an Achilles-heal type cleavage reaction (e^ Koob et aL, 
1990), or the lac repressor can be used to prevent an exonuclease from degrading the 
10 polynucleotide beyond the site (see Johnson et ah, 1990). 

In a manner analogous to the parallel sequencing method, it is possible to 
determine the locations of landmarks in a polynucleotide. In essence, a nucleic acid 
sequencing reaction is a partial "cleavage" reaction of a polynucleotide clone at its 
nucleotides. The construction of a physical map involves the partial "cleavage" of a 
15 polynucleotide clone at its landmarks. The use of sample tags, fractionation and array 
hybridization to reconstruct the pattern or "ladder" of landmarks from many different 
polynucleotides is identical in many respects to the sequencing method. 

A preferred landmark is the restriction site. Indeed, the classic notion of 
physical mapping is restriction mapping. Larger "contigs" are constructed from 
20 polynucleotides by comparing their distribution of restriction sites to look for overlaps 
(e.g. Kohara et al, 1987). The physical map of an entire genome may be constructed 
by determining the restriction maps of subclones in a massively parallel manner 
according to the method of this invention. The use of restriction sites is representative 
and may be substituted by other landmarks. 
25 To construct the physical (restriction) map of a genome, genomic DNA is 

fragmented and subcloned to form a library. Of course, the method is applicable to 
any portion of a genome from which nucleic acid can be prepared, such as, e^ a 
chromosome or a portion thereof. It is within the skill of the art to isolate a portion of 
a genome by, e.g.. flow cytometry and to prepare a library of genomic DNA from it. 
30 In fact, genomic DNA libraries derived from single human chromosomes have been 
constructed by, e^, the United States governments National Laboratories, and such 
libraries are readily available. See, e^, Birren et aL (1996) and Kim et aL (1994). 
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The library is constructed de novo in sample-tagged vectors or an existing 
library can be Retrofit" with sample tags (e^ Frengen et aL, 1999) so that many of 
the sample tags (Lc,, preferably at least 35% of the total) have a unique 
correspondence to only one sample polynucleotide. That is, a sample tag is joined to 
5 only one sample polynucleotide (though one sample polynucleotide can be joined to 
more than one sample tag). The clones are pooled and cut to completion with a 
restriction enzyme that cuts in the vector. Typically, this enzyme will be a "rare 
cutter," that is, it cuts infrequently (e.g., recognizes an 8 base-pair (or longer) 
sequence). A partial digestion is performed on the pooled clones with another 
10 restriction enzyme. The digestion products are separated by size (e.g., by gel 
electrophoresis, chromatography, etc.) , and fractions are collected. Each fraction 
includes a narrow size distribution of fragments. The fractions can be hybridized 
directly to an array of tag complements, or preferably tagged amplicons may be 
amplified from the fractionated DNA before hybridization (e^, using PCR, RNA 
15 polymerase, etc. as described supra) . In a preferred embodiment, the sample tags are 
flanked by sequences common to all the clones as in Figure la. The tags can be 
labeled (e.g. , using fluorescent dyes or other methods known in the art) before or after 
electrophoresis or during amplification using standard techniques (see above and 
Kohara et aL, 1987). As described above, certain in-situ amplification protocols may 
20 be appropriate . 

The restriction digest pattern can be reconstructed for any clone by observing 
the fractions that contain the tag or tagged amplicon that corresponds to the clone. 
This process can be repeated with several different restriction enzymes. The resulting 
partial digest patterns provide a "fingerprint" of every clone in the pool. Identical 
25 fingerprint patterns in a region indicate two clones overlap as disclosed in Kohara et 
aL (1987). Note any polynucleotide cleaved with a restriction enzyme will produce at 
least two fragments, but only the tagged fragment will be visualized. 

In this way, a physical map can be constructed from a pooled sample-tagged 
genomic library without the need to isolate individual clones. However for many 
30 uses, individual clones need to be isolated. If only a few clones are needed, these 
clones can be isolated from the original pool using traditional colony hybridization 
techniques (e.g. Ausubel et aL, 1997). Since a unique sample tag is associated with the 
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clone, the probe would consist of a labeled oligonucleotide that is complementary to 
the sample tag of interest, assuming the sequence of the sample tag is known. If the 
sample tag sequence is not known, one could obtain some genomic sequence from 
every clone using the same array and the sequencing method described above 

Every clone can be isolated by spatially separating individual clones in the 
original pool. These clones are then repooled in a systematic way. For example, one 
million clones can be grouped in three dimensions yielding 300 subpools 
(100x100x100). The work to pick and pool one million clones is not trivial. As 
described above, it may be cost effective to optimize the number of informative 
sample tags; Le. each tag is associated with only one genomic fragment (of course, it 
also is possible to construct a different library in each sample-tagged vector and pool 
one clone from each library). Since the informative sample tags are present only once 
in the original pool, each of these sample tags should be present in only three of the 
subpools (representing the x, y & z dimensions of the 3-dimensional grouping). The 
population of tags in any one subpool can be determined by amplifying the sample 
tags in the subpool (e^ by PCR) followed by hybridization to the array of tag 
complements. Consequently, each sample tag is given a spatial address which 
corresponds to the clone that contains the sample tag. This approach to pooling is 
described in, e.g., Yoshida et aL (1993). 

Genomic clones can be spatially arrayed using flow cytometry. A reporter 
gene (e.g. . Green Fluorescent Protein as disclosed by Chalfie et aL 1996) can be 
included in the cloning vector so that transformed (or transfected) cells can be 
distinguished and separated from "empty" cells. Cells are given sufficient time for 
phenotypic expression after transformation, and then they are subjected to "cell 
sorting" (see Galbraith et aL, 1999). 

It is possible to combine pools after the partial digest, and before size 
separation. This technique is similar to the sequencing method above in which four 
sequencing primers with different 5' sequences are used to identify the four 
terminating nucleotides. In the above physical mapping method, the overhang 
produced by the "rare-cutter" is ligated to an adapter that differs in sequence for each 
enzyme used to perform the partial digests. These sequences will serve as priming 
sites for amplification of the tags after fractionation. The primers can be attached to 
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different labels (e.g. , fluorophores) so that more information can be recovered per 
hybridization. As with the sequencing methods described supra , the inclusion of 
known fragments attached to known sample tags (Ic^, sample-tagged size markers) 
will uniquely identify each fraction and allow precise molecular weight determination 
5 and calibration of signal intensities from one array to another. 

6.4 Massively-parallel methods for locating 
insertion elements 

The methods described above exploit sample-tags to determine either 
sequence or physical map information from sample polynucleotides. Particularly with 

10 respect to adapter tags, the relationship between a specific sample tag and the sample 
polynucleotide is not important, Le., a different sample tag joined to the sample 
polynucleotide would still suffice to practice the inventions. Nevertheless, a 
"byproduct" of the methods is the determination of which sample tag is joined to 
which sample polynucleotide. For example, consider a collection of sample tags and a 

15 collection of sequenced sample polynucleotide clones. The two collections are 
randomly joined to produce sample-tagged clones as described above. The goal is to 
determine the identity of the sample-tag joined to any particular sample 
polynucleotide. One need only sequence the sample-tagged polynucleotides as 
described above to obtain the desired information. Of course, this example is meant to 

20 be illustrative. A more practical use is to randomly join a collection of sample tags to 
chromosomal DNA and determine which tag is coupled to which chromosomal 
region. When the act of joining is performed in vivo, the sequencing and physical 
mapping methods can be used to determine the locations of sample-tagged insertion 
elements. 

25 In a preferred embodiment, a collection of sample-tagged insertion elements is 

prepared as shown in Figure 2, wherein the sample tag comprises a distinct sequence 
element 106 flanked on both sides by common elements 104 and 108. The insertion 
elements are easily constructed, for example by ligating a pool of sample tags into the 
insertion element "backbone". This backbone may reside in a vector that is lost after 

30 integration of the insertion element into the genome (e.g. a suicide vector). The 
insertion element is capable of random integration (or near-random integration) into 
the genome (for example, retroviral vectors for mammalian cells, TnlO vectors for E. 
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coli, P element vectors for Drosophila, transfected DNA of any kind in mammalian 
cells, etc., see Kleckner et aL, 1991; Dellaporta, 1999; Hamilton et aL, 1994; Sands, 
1998). The method may be practiced using any type of cell or cell line capable of 
integrating foreign DNA into the genome, such as Saccharomvces cerevisiae, 
5 Escherichia coli. Bacillus subtiHs, mammalian cell lines, plant cell lines, Drosophila 
embryos, zebra fish cell lines, etc. Several transposons have been shown to function in 
distantly related organisms (Sherman et aL, 1998; Rubin et ah, 1999), suggesting this 
mode of integration may be generalized to virtually any cell. In a preferred 
embodiment, the method may be practiced with cells from which a multicellular 
10 organism can be regenerated such as embryonic stem cells (Stewart, 1993), fetal stem 
cells (Campbell, 1996), plant cells (Azpiroz-Leehan, 1997), eta 

A collection of cell clones is generated by randomly inserting the sample- 
tagged insertion elements into the genome so that usually any one cell (or organism) 
preferably will have undergone only one integration event (note: the analysis is 
15 identical for multiple integration events) and preferably about 36% or more of the 
sample-tagged insertion elements have inserted at only one location in the genome 
(about 36% is easily obtained by choosing about the same number of cell clones as 
unique sample tags). These cells can be spatially separated. For example, mammalian 
cells can be infected with a collection of sample-tagged retroviral vectors. Each 
20 vector may contain a reporter gene (e.g., GFP). The transfected cells (that is, the cells 
that express the reporter gene) can be spatially separated from each other and from 
uninfected cells by flow cytometry and cell sorting (see Galbraith et aL, 1999), or by 
other means. Though this example is directed towards random integration events, the 
method is equally applicable to "targeted" integration events. For example, insertion 
25 elements have been described that target the integration events to genes by providing 
selectable markers that lack promoters, or must be properly spliced to function, etc 
(e.g. Sediw et aL, 1989; Friedrich et aL, 1991; Skarnes et aL, 1995; Ruley et aL, 1997; 
Sands etaL 1998). 

The relationship between the sample tag and the cell clone that contains the 
30 sample tag can be easily determined. The cell clones in the collection can be pooled 
according to some standard scheme such as a 3-dimensional grouping (e^, one 
million clones can be ordered into 100+100+100=300 subpools where each tag is 
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present in 3 subpools. These subpools represent the x, y & z coordinates of the 3- 
dimensional group, see above) The sample tags present in any subpool are 
determined, for example by PCR amplifying genomic DNA from the subpool with 
primers 114 and 116 in Figure 2 to generate tagged amplicons comprising the distinct 
element 106 (or using any other amplification method known in the art), labeling the 
amplicons and hybridizing to an array of tag complements wherein for example each 
feature consists of oligonucleotides complementary to only one distinct sequence 
element 106. A sample tag that is present only once in the collection (that is, it resides 
in only one cell clone) will be present in only three subpools (in this example). The 
three subpools will uniquely define the address of the cell clone that contains the 
sample tag. 

Working with a large collection of cell clones can be very laborious. One can 
increase the number of informative insertion elements among the collection of cell 
clones by choosing fewer cell clones than sample tags. For example, a collection of 
ten million sample-tagged insertion elements may be randomly integrated into cells as 
above, but only one million cell clones are isolated for subsequent analysis. About 
90% of the sample tags will be present in only one cell clone. The sample tags absent 
from the collection of cell clones are easily determined by amplifying the sample tags 
from a pool of the cells and hybridizing the amplicons to an array (or arrays) 
comprising all ten million tag complements. If necessary, new arrays can be 
synthesized with only the informative tag complements for any subsequent analysis. 
6.4.1 Locating insertion elements by 
sequencing 

The position in the genome of any sample-tagged insertion element can be 
determined "en masse ." DNA is prepared from the pooled collection of cell clones. 
Inverse PCR (see Ochman et al, 1988; Silver et aL, 1991) is performed on the DNA as 
shown in Figure 2. The DNA is treated with a restriction enzyme that cuts at site 100. 
The restriction products are circularized with DNA ligase and PCR amplified with the 
primers 112 and 114, which hybridize to common elements 102 and 104 in the 
insertion element. In this example, the amplicons comprise the sample tag and one 
insertion element junction 110. The resulting pool of PCR products is simply a pool of 
sample-tagged polynucleotide clones that can be sequenced using the massively- 
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parallel sequencing method described above, for example using primer 114 as a 
sequencing primer and amplifying the fractionated sequencing products with primers 
114 and 116. 

The method of Inverse PCR described above involves cutting the genomic 
DNA with a restriction enzyme prior to circularization. Consequently, the amplicon 
derived from any particular cell clone will be a polynucleotide clone (Lj^, all the 
polynucleotides will be the same length and essentially identical). This polynucleotide 
clone is correctly termed sample-tagged. However, Inverse PCR could equally be 
performed with randomly sheared DNA. In this case, the amplicon will not be a 
polynucleotide clone, but will consist of polynucleotides of various sizes comprising 
the same sample tag. These tagged polynucleotides will all contain the insertion 
element junction, so it is more appropriate to refer to a sample-tagged junction than a 
sample-tagged amplicon. In both cases, the junction sequence generated by the 
parallel sequencing method will be the same. 

The sequence of the insertion element junctions can be used to locate the sites 
of integration within the genome. Very little sequence information is needed assuming 
the organism has been completely sequenced. Algorithms for comparing nucleotide 
sequences are well known in the art (see, e.g. . Pearson, 1990; Altschul et aL, 1990; 
Suhai, 1997). 

It will be obvious to those skilled in the art that methods other than Inverse 
PCR can be utilized to amplify the sample-tagged junctions. For example, one could 
use Panhandle PCR (Dieffenbach et aL, 1995), Vectorette PCR (Arnold et ah, 1991), 
etc. or even more traditional plasmid rescue protocols (see below) provided the 
insertion element contains the proper functional elements (e.g. selectable marker and 
origin of replication). 

It is also obvious that other parallel sequencing methods can be used to 
sequence the sample-tagged junctions. For example, Brenner describes methods for 
attaching tagged polynucleotides to a solid support by hybridization to arrays or beads 
comprising tag complements (see Brenner 1997a, Brenner et aL, 1998b). The 
molecules can then be subjected to step-wise sequencing reactions (see for example 
Brenner et al, 1998a; Brenner, 1998a; Albrecht et aL, 1997; Cheeseman, 1994, etcj in 
which each "step" generates reaction products from the arrayed polynucleotides. The 
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reaction products are labeled according to a single base (or small number of bases). By 
visualizing the reaction product at each address in the array, a single or small number 
of bases can be determined. Repetition of the process produces more sequence 
information from each tagged polynucleotide in the array. Drmanac et aL (1993) 

5 describe a method for sequencing by hybridization with short oligonucleotide probes. 
In this case, a hybridization reaction can be performed with the tagged polynucleotides 
attached to the array. The reaction products are short labeled oligonucleotides 
hybridized to those tagged polynucleotides containing complementary sequence. By 
repeating the hybridization reaction with different oligonucleotides and noting the 

10 addresses of the labeled reaction products, a sequence "profile" can be constructed for 
the tagged polynucleotides. Usually this profile will consist of several sequence 
contigs for each tagged polynucleotide (corresponding to the oligonucleotide 
sequences). Enough contigs will provide the locations of the insertion elements (at 
least to within several hundred base pairs or so). 

15 6.4.2 Locating insertion elements by 

restriction mapping 
The location of insertion elements can also be determined by partial restriction 
enzyme analysis. In this case, the sample-tagged junctions are isolated from the DNA 
of pooled cell clones by a method that recovers genomic DNA fragments larger than 

20 about 1 kb and more preferably larger than about 5 kb. In vitro amplification methods 
such as those described above can be used with the proper modifications for 
amplifying large fragments, such as Inverse PCR with "long-range PCR" conditions 
(Ohler et aL, 1992; Barnes, 1994). A preferred method of amplifying the junctions is 
plasmid rescue in vivo (see for example Hamilton et aL, 1994). Plasmid rescue entails 

25 cutting the genomic DNA with a restriction enzyme (or randomly shearing the DNA), 
circularizing the products and transforming the DNA into a host such as E. coli. The 
insertion elements must be designed to carry a selectable marker and a plasmid origin 
of replication (or some other element to ensure propagation in the host). Clearly, any 
method capable of recovering large junction fragments is applicable. For example, the 

30 insertion element may include a bacteriophage packaging signal (a pac site) for 
efficient in- vitro packaging of the genomic DNA (or the site, for example the lambda 
cos sequence, can be ligated as an adapter to the genomic DNA) followed by 
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"infection" of the host. The insertion element may comprise for example a YAC 
vector (Burke et al v 1987) and telomeres can be ligated to the genomic DNA followed 
by transformation into S. cerevisiae . The genomic fragments comprising the sample- 
tagged junctions can be enriched in vitro prior to amplification. Taidi-Laskowski et aL 

5 (1988) and Rigas et aL (1986) describe methods for enriching for particular 
polynucleotides in a library by recA-mediated DNA capture. Gossen et aL (1997) 
describe a method for selecting DNA fragments that bind the lac repressor prior to 
plasmid rescue (note: this method requires the insertion elements contain the lac 
operator). Indeed, the Gossen method is easily generalized to any molecule that 

10 recognizes and binds to a particular DNA sequence element. Clearly, any appropriate 
method of enrichment and/or amplification can be used, regardless of the complexity 
because it need only be performed once or a small number of times to rescue the 
sample-tagged junctions from the entire collection of cell clones. 

The sample-tagged junctions are analyzed by the method described above for 

15 physical mapping. In a preferred embodiment, the landmarks are restriction sites. The 
resulting restriction maps are compared to the restriction map of the genomic DNA 
from the organism to determine overlaps between the junctions and the genomic 
DNA. (Note: this analysis can be performed without knowledge of the complete 
genomic sequence; only the sequence of the relevant restriction enzyme sites 

20 throughout the genome is required. This information can be determined by the 
physical mapping procedure described above). It also is worth noting that this strategy 
and the physical mapping method can be performed with much larger tags than is 
practical with the sequencing strategy. 

The tag complements used above for positioning insertion elements by 

25 sequencing or physical mapping are preferably synthesized as short oligonucleotides 
as described below for example by the method of Fodor et aL (1995) or Montgomery 
(1998). In this case, the sequence of the sample tags must be known. The arrays may 
also be constructed by amplifying sample-tags directly from the cell clones and 
"spotting" the amplicons on a slide or synthesizing the arrays by in-situ amplification 

30 of randomly distributed sample tags as described below in section 6.8. In the latter two 
examples, the sequence of the sample tags need not be known. Note if spotting is used 
to construct the arrays, then each tag complement (derived from an amplified sample 
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tag) can have an address in the array that corresponds to the address of the cell clone 
(see for example Hensel et al^ 1995). As a result, the sequence or physical map 
associated with each sample tag is already associated with a cell clone (by virtue of 
the address of the tag complement), so the analysis with subpools described above is 
not necessary. Of course, this latter spotting method is only informative when a cell 
clone contains only one sample-tagged insertion element. 

In some cases, the application of both strategies outlined above (sequencing 
and physical mapping) may be used to determine precisely the position of insertion 
elements. For example, a genome with many repetitive elements may be refractory to 
analysis by sequencing alone. Integration events that occur in repetitive elements will 
not always yield single copy sequence (that is, sequence that occurs only once in the 
haploid genome). However, restriction mapping can provide positional information 
that covers many thousands of base pairs. This information will usually place the 
insertion element at a single location in the genome. The sequence information then 
can be used to locate the exact position of the insertion element to single base 
resolution. 

By application of the methods described above, the location in the genome of 
sample-tagged insertion elements can be determined as well as the spatial locations of 
the cell clones (or organisms) that contain the sample tags. The insertion elements 
that integrate within coding regions will often disrupt proper gene function. These 
integration events are gene knockouts. By application of the methods to totipotent cell 
lines (such as embryonic stem cells), multicellular organisms carrying the knockouts 
can be constructed. If a particular gene is not "hit" by an insertion element, it is 
possible to use insertion elements in surrounding regions to delete the gene. For 
instance, FRT sites can be incorporated into the insertion element to facilitated site- 
specific recombination (via FLP recombinase) between two insertion elements. If the 
two insertion elements do not already exist in the same cell, they can be crossed 
together by mating (assuming the cells or organisms are capable of mating). One 
product of the recombination event is a deletion of the DNA between the two vectors 
(see Golic, 1991 & 1994; Golic et ah, 1996; Xu et aL, 1993; Kilby et aL, 1993). In 
this way, other types of chromosomal deletions can be generated. 
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6.4.3 Locating insertion elements with 
"genomic" tags 

Alternative methods are available for determining the genomic position of the 
insertion element. These alternatives do not require a sample tag to be present in the 
5 insertion element. The sample tag is provided by the genomic DNA. These methods 
can be particularly useful when it is prohibitively difficult to incorporate exogenous 
DNA into the genome. For example in Drosophila melanogaster, an appropriate 
mating protocol can be used to generate many offspring that have undergone 
independent germ-line transposition events of an endogenous P element (see Hamilton 
10 et aL, 1994). However, the introduction of an in-vitro modified P element into the 
genome can be very time consuming and costly. Consequently, an alternative to the 
use of sample-tagged P elements is beneficial. 

In one embodiment, a collection of cells (or organisms) with insertion 
elements is prepared. The collection is grouped into subpools according to a standard 
15 scheme, for example 3-dimensional pooling as described above. From each subpool, 
insertion element junctions are rescued by Inverse PCR or by another standard method 
(as described above). The amplified junctions are labeled and hybridized to an array 
of tag complements. In this case, the tag complements are prepared from the genomic 
sequence. For example, short single-copy sequences that are randomly distributed 
20 throughout the genome may be synthesized in arrays according to the methods of 
Fodor et aL (1995) or Montgomery (1998), or the sequences may be derived from 
ESTs (expressed sequence tags, see Adams et aL, 1991) or the junction sequences 
themselves (see below in section 6.4.3.1). Hybridization to the array reveals the spatial 
address of the cell clone (or clones) that contains an insertion element in the genome 
25 near the genomic tag (that is, which cell contains a junction comprising the genomic 
tag). Depending on the number of polynucleotide clones in the collection and the 
length of rescued DNA, multiple clones may hybridize to the same tag complement 
(i.e.. the insertion elements integrated near the same genomic tag). In this case, the 
spatial address is ambiguous. It is possible to minimize these ambiguities by 
30 analyzing the collection pooled according to a second scheme different from the first 
as described in Barillot et aL (1991). A preferred pooling scheme will provide 
unambiguous spatial addresses for the clones without the need for further analysis of 
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subpools. Other pooling schemes employ two or more separate steps to determine 
addresses (see for example, Hamilton et aL, 1991), where subsequent steps require 
analysis of fewer subpools. While these other schemes require analysis of fewer 
subpools to determine the address of a single cell clone, the work cannot be performed 

5 efficiently in parallel. These step wise pooling strategies are more appropriate for 
positioning one or a small number of insertion elements at a time. 

Clearly, the above analysis provides more information than simply the spatial 
addresses of the "genomic-tagged" cell clones. For example, if the genomic sequence 
is known, then the locations of the genomic tags are known. Consequently, the 

10 approximate positions of the genomic-tagged junctions are known. If the absolute 
positions of the genomic tags in the genome are not known (for example, the tag 
complements could be derived from unmapped cDNA sequences), one still knows the 
approximate "relative" locations of the insertion elements. 



elements by parallel sequencing 

One method to obtain more precise positional information (absolute or 
relative) is simply to sequence the junctions in parallel. A subset of the original 
collection of cell clones is pooled and the insertion element junctions are amplified as 

20 described above. This subset is chosen so that many of the genomic tags (preferably 
more than about half) are present in only one cell clone. These amplicons are joined to 
sample tags and sequenced according to the method of this invention (or any other 
tag-based parallel sequencing method as described above). Preferably, the entire 
genome has already been sequenced so any sequence from a junction will immediately 

25 position it and reveal the associated genomic tag (regardless of whether or not the 
genomic tag is actually sequenced along with the junction). 

Preferably, the order of events is reversed; the junctions are sequenced first 
and then the genomic tags are chosen, the array of tag complements is synthesized, 
and the spatial addresses are determined. If desired, junctions from the entire 

30 collection of cell clones can be prepared as one pool, joined to sequence tags and 
sequenced. Different subpools of the sequence-tagged junctions can be sequenced 
until nearly all of the junctions are analyzed. For example, consider a collection of 
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100,000 cell clones. The 100,000 junctions are cloned into a pool of about 300,000 
different sequence-tagged vectors. About 300,000 clones are pooled and sequenced in 
parallel with an array of 300,000 tag complements. Approximately 100,000 clones in 
the pool are sequence-tagged and so yield sequence data (that is, about 36% (1/e) of 
5 the sequence tags are associated with only one junction). Those 100,000 clones will 
yield sequence information from about 64,000 different junctions. Repeating the 
process on a different pool of 300,000 clones will yield the sequence from about 64% 
of the remaining junctions (approaching 90,000 sequenced junctions). Tag 
complements are designed from the sequence information and a new array is 
10 synthesized. Now the collection of cell clones is repooled in a 3-dimensional array 
with 47 sub-pools per dimension. The junctions are amplified from each subpool and 
separately hybridized to the array to determine cell clone addresses. This order of 
events permits one to locate the insertion elements with genomic tags without first 
having any genomic sequence information. Of course, the locations are relative but 
15 can later be placed in their absolute chromosomal locations for example by completely 
sequencing the genome. 

Restriction maps for genomic DNA flanking the insertion elements can be 
quickly constructed by rescuing large fragments comprising the junctions in sample- 
tagged vectors using an appropriate protocol such as plasmid rescue as described 
20 above. The analysis of these sample-tagged junctions is identical to the methods 
described in Section 6.3. The resulting restriction maps can be easily aligned with the 
genomic tags described in the previous paragraph if the genomic sequence is known. 
Alternatively, the sequence of the junctions can be obtained directly from these large 
sequence-tagged clones (or smaller subclones comprising the junctions and sample 
25 tags can be generated and sequenced), so the same sample tag is used for both 
sequencing and physical mapping. In this way, any rearrangements in genomic DNA 
flanking the insertion elements can be quickly ascertained. 

6,43.2 Refining the locations of 
genomic-tagged insertion 
30 elements by physical mapping 

An alternative method to more precisely locate the "genomic-tagged" 
junctions is to employ a fractionation approach similar to the physical mapping 
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method described in section 6.3. Inverse PCR, plasmid rescue and other methods can 
rely on cleavage of genomic DNA at well defined locations. Restriction enzymes are 
usually employed though other methods exist for cutting DNA at defined sites. See 
Szybalski (1997) for a review of some other methods. The distance between the 

5 cleavage site and the integration event determines the length of the rescued DNA 
fragment. Consequently, if the distance between the genomic tag and the cleavage site 
is known, then knowledge of the size of the rescued DNA will more precisely position 
the site of the insertion element. Preferably, the genome is sequenced before 
performing this analysis to simplify the choice of genomic tags near cleavage sites. 

10 The size of the rescued DNA is readily determined. Genomic-tagged junctions 

are prepared from the cell clones and separated by size (for example by gel 
electrophoresis or chromatography). Fractions are collected and the junctions in each 
fraction are amplified and labeled with the same (or nearby) primers that first were 
used to amplify the DNA. The genomic-tagged amplicons in each fraction are 

15 separately hybridized to an array of tag complements. The hybridization patterns can 
be deconvoluted as described above for the physical mapping method to determine the 
fragment size. Note that the genomic tags will be present in only a small subset of 
fractions since only one fragment size per genomic tag is present in the collection. Of 
course, inclusion of the appropriate size standards before fractionation will increase 

20 accuracy. 

In the embodiment described above, the entire length of DNA between the 
cleavage site and the insertion element is hybridized to the array. This requirement 
limits the range of detectable integration events from a particular cleavage site. If the 
genomic tag is located within several hundred base pairs, more preferably within one 

25 to twenty base pairs of the cleavage site, then it is possible to determine the location of 
integration events many thousands of base pairs from the cleavage site. Similar to the 
methods above, the collection of integration events is pooled in a standard fashion 
such as a 3-dimensional scheme. Genomic DNA in the neighborhood of the insertion 
element is rescued from the subpools by a method appropriate to larger DNA 

30 fragments (e.g.. plasmid rescue, see above section 6.3). The genomic DNA can be 
rescued so that the cleavage site is juxtaposed to a known sequence, such as one end 
of the insertion element or an adapter. Alternatively, DNA can be rescued from the 
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subpools, then cut at the cleavage site and joined to a known sequence. In either case, 
a defined sequence is joined to the cleavage site therefore a defined sequence is close 
to the genomic tag. Now the genomic tag sequences can be amplified by any standard 
method for amplifying DNA at vector-insert junctions. Representative approaches are 
5 disclosed by Swensen (1996); Huang (1997); Ogilvie et aL (1996); Wu et aL (1996). 
The amplified genomic tags are hybridized to an array of tag complements and the 
spatial addresses are determined as above. 

More precise positional information can be obtained by determining the length 
of the DNA fragment that separates the cleavage site from the insertion element. 
10 Large fragments comprising the junctions are rescued from the pooled collection of 
cell clones using the same method that was previously applied to the subpools 
(alternatively, the rescued DNA from all the subpools can be combined into one pool. 
A defined sequence is joined to the cleavage site (near the genomic tag as described 
above) and the rescued DNA is linearized, preferably at a restriction site very near the 
15 junction for example a site engineered into the insertion element. The end result is a 
collection of linear molecules. Each molecule has a defined sequence near the 
genomic tag at one end and some insertion element sequence at the opposite end. Of 
course, other DNA fragments may be generated during this process, but they are not 
joined to genomic tags. Now this collection of linear molecules can be fractionated by 
20 size. The genomic tags in each fraction are amplified and labeled as above. Finally 
the genomic tags are hybridized to an array of tag complements and the size of each 
linear molecule is deconvoluted from this data. 

Further resolution of the positions of vector integration may be achieved by 
performing a partial restriction digest on the collection of linear molecules. The 
25 analysis is identical to that described in section 6.3 with the exception that tagged 
amplicons are generated using a vector-insert junction amplification protocol. 
Restriction mapping has the advantage that the distance between the integration site 
and other, closer cleavage sites will be known. In addition, comparison of the 
restriction map to the genomic restriction map will uncover any DNA rearrangements 
30 that may have occurred during any step of the procedure. 
6.5 Sample Tags 
Sample tags used in the analysis of sample polynucleotides fall into two main 
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classes: adapter tags and genomic tags. In general, adapter tags are joined to the 
sample polynucleotides to practice the invention. Genomic tags comprise sequence 
elements that are contained in the sample polynucleotides in their natural state prior to 
practicing the invention (e.g., the sample tag comprises cDNA or genomic DNA). Of 
5 course, one skilled in the art will recognize sample tags can be a combination of 
adapter and genomic tags, indeed many genomic tags will comprise additional 
sequences joined to the sample polynucleotide to practice the invention (for example, 
to facilitate amplification of the genomic tags). Genomic tags are particularly useful 
to position certain insertion elements, as described above. Genomic tags may also 
10 substitute for adapter tags in the sequencing and physical mapping embodiments. 

6.5.1 Designing Tags 
A preferred form of adapter tag is shown in Figure la. A variable, or distinct 
sequence element 12 is flanked on both sides by common sequence elements 10 and 
14. The distinct element is used to identify the sample polynucleotide. The common 
15 elements, which are shared by many sample tagged polynucleotides, are used as 
priming sites to amplify the sample tag. Methods for designing the distinct sequence 
elements are well known in the art. For example, Brenner (1997b) teaches how to use 
a simple algorithm to choose suitable tags. 

In vitro selections can be employed to create a pool of sample tags. 
20 Montgomery (1998) and Fodor et aL (1995) teach methods for making arrays with 
oligonucleotides of any sequence. Consider an array of 1000 or more oligonucleotides 
wherein each oligonucleotide comprises a distinct element flanked on both sides by 
common elements. In addition, the common elements contain the recognition 
sequence for a restriction enzyme that cuts at or near the two junctions of the distinct 
25 and common sequence elements. Now, the distinct elements can be PCR amplified 
from the array by priming at the common elements (DNA polymerases are known to 
function on arrays, see Bulyk et aL, 1999). A label (e^, fluorescein) can be 
incorporated into the amplicon. The common elements are separated from the distinct 
elements by cleaving the amplicons with the restriction enzyme. An affinity moiety 
30 (e.g.. biotin) can be included in the PCR primers to facilitate affinity separation of the 
common elements (and uncleaved amplicons) from the distinct elements. These 
distinct elements are hybridized to the array. Alternatively, the common elements do 



WO 00/24937 Ari PCT/US99A25037 

40 

not have to be removed from the amplicons if the hybridization is to a second array of 
oligonucleotides comprising only the distinct sequence elements and not the common 
elements. Only those sequences that produce strong hybridization signals in this assay 
are chosen as sample tags. For example, a second array with only the chosen 

5 sequences can be synthesized as above. The tags are amplified from the array and 
joined directly to sample polynucleotides or the tags are cloned into vectors for 
subsequent manipulations. 

A variation of the above in vitro selection is possible. In this example, the 
distinct sequence elements are randomly synthesized on a DNA synthesizer (e.g., ABI 

10 Model 394). For instance, all possible 20 base oligonucleotides can be synthesized at 
one time by programming the synthesizer to incorporate all four bases at each 
position. The mixture of random oligonucleotides is cloned into a vector, and a 
random subset of 1000 or more clones is chosen for further analysis. Now the distinct 
sequence elements can be amplified by priming in the vector sequences on either side 

15 of the distinct element. It is possible to select for optimal adapter tags by denaturing 
the amplicons and selecting for rapidly reriaturing distinct sequence elements. For 
example, the renatured amplicons can be treated with a single-strand specific 
endonuclease (e.g., Mung Bean Nuclease or SI nuclease) to destroy mismatched 
duplexes and single-strand DNA. Surviving DNA can be reamplified and cloned or 

20 subjected to another round of selection. Other selections are possible. For example, 
the amplicons can be designed as above where the PCR primers contain an affinity 
moiety (e.g., biotin) and the common sequence elements contain restriction enzyme 
recognition sites near the junctions of the common and distinct sequence elements. 
The affinity moiety is used to bind the tags to a solid support. The restriction sites are 

25 used to ligate the distinct sequence elements to a different vector or adapter thereby 
replacing the first set of common sequences with a second set. The distinct sequence 
elements are PCR amplified with a second pair of primers specific to the second set of 
common sequence elements. The resulting amplicon is hybridized to the first 
amplicon bound to the solid support. Unhybridized strands are washed away and 

30 hybridized strands are denatured and reamplified with the second pair of primers. 

The random tag selections described above yield populations of sample tags 
with distinct sequence elements of unknown sequence. These sample tags may be 
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used to make arrays by spotting (see Brown, 1998) or as outlined below, to make 
arrays wherein the tag is a "place holder". Therefore the sequence of the sample tags 
need not be determined. However, to utilize arrays made according to the methods of 
Fodor et ah (1995) or Montgomery (1998), the sequence of the tags must be 
determined. Of course, each tag could be individually cloned and sequenced. A more 
preferred method is simply to join the first set of tags of unknown sequence to a 
second set of tags and sequence the first set in parallel according to the method of this 
invention. That is, the first tags are the sample sequence elements with respect to the 
second tags. By repeating the process, one set of tags can be used to construct a larger 
set of tags. 

In some embodiments, sample tags may be larger than the synthetic 
oligonucleotides described above. For example, restriction mapping may be 
performed on sample polynucleotides that are hundreds of kilobases in size. 
Consequently, large sample tags even greater than one kilobase can be tolerated. A 
simple way to construct a collection of sample-tagged vectors for cloning sample- 
tagged polynucleotides is to clone into a vector a random collection of fragments from 
genomic DNA (or normalized mRNA). Each random fragment (and in some cases 
flanking vector DNA) serves as a sample tag. Sample polynucleotides are cloned into 
the collection of sample-tagged vectors. Arrays may be constructed by separately 
PCR amplifying and spotting the random fragments in the vector. Also commercially 
available arrays may be used. For example, Affymetrix sells arrays of 
oligonucleotides that hybridize to yeast (S. cerevisiae) DNA and mRNA. In this case, 
one tag made from yeast sequences may hybridize to multiple different 
oligonucleotides in the array. 

6,5.2 Multiple sample tags per sample 
polynucleotide 

In some embodiments, it is useful to join more than one sample tag to a 
polynucleotide. For example, sequence information or a restriction map may be 
obtained from both ends of a sample polynucleotide. Consider a "dual tag" vector that 
comprises two different tags on either side of a cloning site into which the sample 
polynucleotide is inserted. Of course, a collection of these vectors could be 
constructed one at a time, but this method is too time-consuming to construct large 
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sets of dual-tag vectors. Individual sample tags, selected as described above, can be 
synthesized as pairs in an array, for example by the method of Fodor et aL (1995) or 
Montgomery, so that a cloning site separates each pair and common sequence 
elements flank both sides of a pair. The pairs of sample tags can be amplified by the 

5 common regions and cloned into a vector to form sample-tagged vectors. Sample 
polynucleotides are cloned between the pairs of sample tags at the cloning site. 

One can construct a collection of vectors with one tag as outlined above and 
then randomly insert a second collection of tags into this collection of sample-tagged 
vectors. However, information about the relationship between the two tags is useful 

10 ( i.e. , which tags are in the same vector). In this way, information derived from 
opposite ends of the same sample polynucleotide can be related. A simple way to 
relate the two set of tags is to use one set to sequence the other set according to the 
method of this invention. Naturally, the random distribution of the second set of tags 
means a one-to-one relationship between tags in the two sets will not always exist. 

15 This problem is minimized by working with two large collections of tags and 
choosing a smaller collection of dual-tag vectors. For example, the two collections of 
tags may each contain 10 million distinct tags. After randomly joining the two 
collections, a million dual-tag vector clones can be chosen randomly, in which case 
more than 80% of the dual-tag vectors will comprise two tags that can uniquely 

20 identify each other. 

Another method to obtain data from both ends of a sample polynucleotide 
employs the vector design in Figure 3. Site 70 represents the recombination site for a 
site-specific recombinase (for example, a lox site where ere recombinase acts or a 
FRT site where FLP recombinase acts), and the orientation of the site is represented 

25 by the direction of the arrow. The common elements 60, 64 and 68 are present in all 
the sample-tagged clones, whereas the distinct element 62 uniquely corresponds to the 
sample sequence element 66. Sample tag A for analyzing one end of sample 66 
comprises the following sequence elements: 60, 62, 70 and 64. Sample tag A can be 
PCR amplified with primers 80 and 82. Sample tag B for analyzing the opposite end 

30 of sample 66 comprises 60, 62, 70 and 68. Sample tag B can be PCR amplified with 
primers 80 and 84. The tags are identified by hybridizing the amplicons to tag 
complements comprising at least part of the distinct element 62. Notice sample tag B 
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does not exist until the sample-tagged clones are exposed to the site specific 
recombinase. After exposure to the recombinase, the population of sample-tagged 
clones can contain approximately equal amounts of the sample tag A and B forms. 
The invention is practiced on this mixture, for example, the clones can be sequenced 

5 with primer 80. Only one set of tags (either sample tag A set or sample tag B set) is 
amplified at a time. In this way, the same distinct sequence element 62 is used to 
obtain data from both ends of the sample 66. The collection of sample-tagged vectors 
can be constructed as outlined above for "single-tag" vectors. 

Three or more sample tags can be used to analyze a sample polynucleotide. 

10 For example, sample-tagged transposons can be used to randomly insert multiple 
sample tags in vitro or in vivo (see Strathmann et aL, 1991; Craig, 1996, Smith et ah, 
1995). 

6.6 Separating and fractionating tagged reaction 
products 

15 Numerous methods exist for separating nucleic acids by size, for example, 

chromatography (e^, Bloch, 1999; Gjerde, 1999; Thayer etaL, 1996; Hearn, 1991), 
electrophoresis and "Time of Flight" separations based on charge to mass ratios (e.g. . 
MALDI-TOF). Different methods resolve DNA fragments in different size ranges 
and will be appropriate to different embodiments of the invention. It is clear to one 

20 skilled in the art that any method of separation can be used which resolves fragments 
in the appropriate size range and permits collecting the fragments in a form 
compatible with subsequent amplification and/or hybridization to an array. 

A preferred method of separation is gel electrophoresis. Agarose is a preferred 
gel matrix for separating nucleic acid fragments that differ in size by tens to thousands 

25 of bases. Polyacrylamide is a preferred gel matrix when single-base resolution is 
required such as sequencing embodiments. Electrophoresis may be performed in, for 
example, slab gels and capillaries. 

Fractionation simply entails collecting the DNA fragments in one size range 
away from the DNA fragments in other size ranges. For example, a gel containing 

30 electrophoresed DNA fragments can be physically sliced into sections perpendicular 
to the direction of electrophoresis, and the DNA fragments can be removed from each 
slice by several means (e.g., P-agarase digestion, electroelution, etc. see Ausubel et 
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aL, 1997). Andersen (1998) describes an apparatus for electrocuting and collecting 
separated molecules from a gel en masse, without slicing the gel. Alternatively, the 
DNA fragments can be collected as they electrophorese through the end of the gel. 
The fragments may be collected onto ionized paper or simply collected in separate 
5 containers (see for example Beck, 1993; Richterich et aL, 1993; Xu et aL, 1997; 
Wong, 1999; Mills, 1993; Karger, 1996; Kambara, 1996; Israel, 1976). 

Chromatography (e.g.. HPLC) can be performed with computer-controlled 
instruments such that eluting DNA fragments are automatically collected in separate 
containers for further analysis (Weston, 1997; Bloch, 1999). 
10 6.7 Amplifying tags 

A critical element of the invention is the ability to amplify tags before and/or 
after hybridization to the array. 

6.7.1 Amplification prior to hybridization 
Amplification of tagged reaction products can generate tagged amplicons with 
15 much lower sequence complexity. Consequently, there is more material to perfectly 
hybridize to the tag complements and there is much less material to cross-hybridize to 
the tag complements. Both factors contribute to improving the signal to noise ratio 
(i.e. , sensitivity) of hybridization to the array of tag complements. More copies of 
each tag will drive the hybridization kinetics, which allows more tags to be analyzed 
20 in each hybridization reaction. The lower complexity of material not meant to 
hybridize to the array will minimize the presence of false signals or background due to 
cross-hybridization. 

6.7.1.1 Adapter tags 
The adapter tags shown in Figure la are easily amplified by the preferred 

25 method of the polymerase chain reaction with primer 18 and primer 20. Other 
methods of in vitro amplification will also work, for example 3SR and related 
methods (e^ Gingeras et aL, 1988; Kwoh et aL, 1989; Gebinoga et aL, 1996), Strand 
Displacement Amplification (Walker et al, 1992 & 1993) and rolling circle 
amplification (Lizardi et aL, 1998; Zhang et aL, 1998). Linear amplification methods 

30 can be used. For example, one of the common regions may encode a promoter for an 
RNA polymerase (e.g., T7, T3 and SP6), and in vitro transcription will amplify the 
tag. A "one-sided" PCR reaction in which one primer is in excess over the other 
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primer ultimately will produce a linear amplification of the tag. One could even 
amplify the tags with more traditional recombinant DNA methods involving cloning 
the tags into a vector and passaging the clones through a host such as E. coli. 

The in vitro methods of amplification can produce double-stranded amplicons. 

5 To maximize hybridization to tag complements, one strand of the amplicon may be 
removed prior to hybridization to the array. For example an affinity moiety, such as 
biotin, may be incorporated in one of the primers. The amplicon can be denatured and 
the biotin-containing strand removed with streptavidin coated beads (e.g. Mitchell et 
aU 1989). Enzymatic methods also may be used to remove one of the strands. For 

10 example lambda exonuclease preferentially degrades DNA with a 5'-phosphate group. 
By incorporating a 5'-phosphate in only one of the primers, only one strand of the 
amplicon will be degraded (see Ausubel et ah, 1997; Takagi et aL, 1993). 

Other modifications to the amplicon may facilitate hybridization. The common 
sequence elements 10 and 14 depicted in Figure la, can be removed from the 

15 amplicons prior to hybridization by incorporating restriction enzyme recognition 
sequences in the common sequence elements. By choosing enzymes that cleave 
outside their recognition sequences (e.g. , BsrDI, BsmBI, etcj, it is possible to 
completely separate the common sequence elements from the distinct element 12 by 
cutting the amplicons with the enzymes. 

20 6,7.1 .2 Genomic Tags 

Genomic Tags are not as easily amplified as adapter tags because in some 
embodiments, common sequence elements cannot be so readily designed to flank both 
sides of the distinct genomic sequence element in the sample-tagged polynucleotide. 
Consider a genomic tag consisting of a common sequence element shared by other 

25 sample-tagged polynucleotides and an adjacent sequence element from the sample 
polynucleotide. Amplifying the genomic tag is analogous to amplifying the DNA at a 
vector-insert junction. Representative approaches are disclosed by Riley et aL (1990), 
Lagerstrom et aL (1991), Kere et aL (1992) and Liu et aL (1995). Inverse PCR (Silver 
et aL, 1991) is a simple method to provide a second common sequence element for 

30 amplification of the genomic tag by PCR. Double-stranded tagged reaction products 
are cut with a restriction enzyme and ligated under conditions that promote 
circularization. Now the first common sequence flanks both sides of the distinct 
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sequence element provided by the sample polynucleotide. Of course, not all 
embodiments of this invention yield double-stranded reaction productions (for 
example, some sequencing embodiments) so the reaction products first must be 
converted to the duplex form. Synthesis of the complementary strand can be achieved 

5 in vitro with small random primers and a DNA polymerase (e.g. , T4 DNA 
polymerase, the Klenow fragment, etc.) . Alternatively, T4 RNA ligase can circularize 
single stranded DNA and RNA. 

Another method to provide a second common sequence element entails 
engineering a restriction enzyme site in the first common sequence element. Certain 

10 restriction enzymes cleave well away from their recognition sequence (e.g. , Bpm 1, 
Bsg I, Mme I, etc.) . These enzymes can cut up to 20 base pairs into the sample 
sequence elements. The second common sequence element can be ligated as an 
adapter to the cleaved reaction products. Prior to amplification, the genomic tags can 
be purified from other ligation products by, for example, denaturation, followed by 

15 hybridization to a solid support-bound oligonucleotide that is complementary to the 
first common sequence. 

Vector-insert junctions are routinely amplified by providing the second 
common sequence element during a random priming event. A first oligonucleotide of 
known sequence (the primer oligo), sometimes coupled to random bases at the 3'-end, 

20 is used to prime DNA synthesis after denaturation of double-stranded DNA. If the 
primer oligo initiates DNA synthesis near the vector junction (Le., near the first 
common sequence element), then the junctions (i.e. , genomic tags) can be amplified 
by PCR with the primer oligo and an oligonucleotide complementary to the vector. A 
variation of this strategy for amplifying short genomic tags entails tethering the primer 

25 oligo to the first common sequence element. The local concentration of the primer 
oligo becomes very high near the genomic tag, which means the random priming 
event is likely to occur very close to the tag. The tethering event can be accomplished 
by first tethering the primer oligo to a second oligonucleotide that is complementary 
to the first common sequence element. The two oligonucleotides may contain a biotin 

30 moiety and they are coupled by a streptavidin "bridge." Hybridization of the second 
oligonucleotide to the first common sequence element serves to tether the primer oligo 
to this region. Of course, other functionally equivalent methods to tether two 
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oligonucleotides can be used to practice this embodiment. 

6.7.2 hi situ amplification 
Tagged reaction products may be amplified after hybridization to the array. 
The amplicons must remain tightly associated with the hybridized products from 
5 which they are derived. This association may be maintained by a physical coupling of 
the reaction products and amplicons (e.g. , rolling circle amplification) or diffusion of 
the amplicons can be restricted. 

Several methods have been described for in situ amplification. Lizardi (1998) 
and Zhang et al (1998) disclose methods employing rolling circle replication and 
10 strand displacement. For example, consider a linear tagged reaction product with a 
tag comprising a common sequence element at the 5'-end of the reaction product 
flanked by a distinct sequence element. The tag complement, to which the reaction 
product is hybridized, consists of an oligonucleotide complementary to the distinct 
element. Another oligonucleotide, complementary to the common element with an 
15 additional sequence element at the 3'-end, may be hybridized to the common element, 
then ligated to the tag complement. The additional 3'-overhanging sequence element 
can prime rolling circle replication from a closed circular DNA molecule provided in 
solution. The amplicons are covalently coupled to the tag complement. 

Obvious variations are possible, for example, the tagged reaction product may 
20 possess the common element at the 3' -end, in which case rolling circle replication 
may be primed directly from this sequence element. In addition, Lizardi et al. (1998) 
describe the use of oligonucleotides with reversed backbones capable of hybridizing 
to the common sequence element while providing an overhanging 3' -end that can 
prime rolling circle replication. The reverse-backbone oligonucleotide may be 
25 hybridized to the common element, then ligated as above to the tag complement, 
allowing the rolling circle replication products to be covalently coupled to the tag 
complement. The tagged reaction product itself may be circularized prior to 
hybridization to the array. In this case, the tag complement may prime rolling circle 
replication directly from this circular substrate. 
30 Adams et ah (1997) describe a method for in situ amplification in which two 

primers for PCR are attached to a solid support. Consider the tagged reaction product 
described above, in which the tag consists of a common element at the 5*-end 
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followed by a distinct element. The array comprises oligonucleotide tag 
complements, coupled to a solid support at their 5 '-end in addressable locations and a 
common oligonucleotide identical in sequence to the common region, distributed 
throughout the array. Assuming only the non-complementary strand of the common 
5 sequence element is present in the hybridization reaction, the reaction product can 
only hybridize at the tag complement. A subsequent polymerization reaction will 
extend the tag complement into the common sequence element. The resulting 
extension product can be amplified in situ by PCR according to the method of Adams 
et ah (1997). 

10 Chetverin et ah (1997) and Church (1999) describe methods for amplifying 

nucleic acids in an immobilized medium to generate discrete "colonies" of amplicons. 
This method can be applied to the present invention by adding the immobilization 
media to the array, for example after hybridization of the tagged reaction products. 
Church describes a method to attach the nucleic acids to the immobilization media 

15 with a polymerization reaction that is primed by a complementary oligonucleotide 
already attached to the media (see Khrapko et ah (1996) and Kenney et aL (1998) for 
other methods of attaching oligonucleotides to agarose and polyacrylamide 
membranes). Amplification is performed using 3SR (Gingeras, 1988) in which the 
oligonucleotides encode the promoter for an RNA polymerase (e.g., T7). 

20 Amplification occurs exponentially in a reaction that couples transcription, reverse 
transcription and second-strand synthesis. In a preferred embodiment, the 
oligonucleotide bound to the immobilization media does not encode the promoter. A 
second oligonucleotide that encodes the promoter is free to diffuse throughout the 
media. In this way, a "one-sided" 3SR reaction is performed on the immobilized 

25 nucleic acid. The bound oligonucleotide hybridizes to the newly synthesized 
transcripts, which limits diffusion and primes reverse transcription, thereby producing 
exponential amplification. 

6.8 The Array 

Preferably, detection of hybridization information takes place at spatially 
30 discrete locations where tags hybridize to their complements. It is important that the 
detection of signals from different fractions or pools be associated with tag 
complement locations that can be identified throughout the procedure. Otherwise, the 
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sequence of signals will not be a faithful representation of the mobility and/or spatial 
address of the polynucleotide fragments corresponding to the tag and tag complement. 
This requirement is met by providing a spatially addressable array of tag 
complements. For some embodiments, knowledge of the identity of a tag complement 

5 is not crucial; it is only important that its location be identifiable from one 
hybridization to another. Preferably, the regions containing tag complements are 
discrete, Le^, non-overlapping with regions containing different tag complements, so 
that signal detection is more convenient. Generally, spatially addressable arrays are 
constructed by attaching or synthesizing tag complements on solid phase supports. 

10 Solid phase supports for use with the invention may have a wide variety of forms, 
including microparticles, beads, and membranes, slides, plates, micromachined chips, 
and the like. Likewise, solid phase supports of the invention may comprise a wide 
variety of compositions, including glass, plastic, silicon, alkanethiolate-derivatized 
gold, cellulose, low cross-linked and high cross-linked polystyrene, silica gel, 

15 polyamide, and the like. Preferably, either a population of discrete particles is 
employed such that each particle has a uniform coating, or population, of 
complementary sequences of the same tag (and no other), or a single or a few supports 
are employed with spatially discrete regions each containing a uniform coating, or 
population, of complementary sequences to the same tag (and no other). In the latter 

20 embodiment, the area of the regions may vary according to particular applications; 
usually, the regions range in area from several urn 2 , e^, 3-5, to several hundred jim 2 , 
e.g., 100-500. 

Tag complements are preferably polynucleotides, and they may be used with 
the solid phase support that they are synthesized on, or they may be separately 

25 prepared and attached to a solid phase support for use, e^g., as disclosed by Lund et aL 
(1988); Albretsen et aL (1990); Wolf et aL (1987); Ghosh et aL (1987); or Brown et aL 
(1998). Preferably, tag complements are synthesized on and used with the same solid 
phase support, which may comprise a variety of forms and include a variety of linking 
moieties. Such supports may comprise microparticles or arrays, or matrices, of 

30 regions where uniform populations of tag complements are synthesized. A wide 
variety of solid supports may be used with the invention, including supports made of 
controlled pore glass (CPG), highly cross-linked polystyrene, acrylic copolymers, 
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cellulose, nylon, dextran, latex, polyacrolein, and the like, disclosed in the following 
exemplary references: Mosbach (1976); Rembaum et aL (1977); Rembaum (1983 & 
1987); and Pon (1993). Solid supports further include commercially available 
nucleoside-derivatized CPG and polystyrene beads (e.g., available from Applied 

5 Biosystems, Foster City, Calif.); derivatized magnetic beads; polystyrene grafted with 
polyethylene glycol (e.g., TentaGelO, Rapp Polymere, Tubingen Germany); and the 
like. Selection of the support characteristics, such as material, porosity, size, shape, 
and the like, and the type of linking moiety employed depends on the conditions under 
which the tags are used. Exemplary linking moieties are disclosed in Pon et aL 

10 (1988); Webb (1987); Barany et aL (1993); Damha et aL (1990); Beattie et aL, (1993); 
Maskos et aL (1992); and the like. When tag complements are attached or synthesized 
on microparticles, populations of microparticles are fixed to a solid phase support to 
form a spatially addressable array as disclosed in Brenner (1997a, 1998b) 

As mentioned above, tag complements also may be synthesized on a single (or 

15 a few) solid phase supports] to form an array of features uniformly coated with tag 
complements. That is, within each feature in such an array the same tag complement 
is synthesized. Techniques for synthesizing such arrays are disclosed in Fodor et aL 
(1995); Pease et aL (1994); Southern (1997); Maskos et aL (1992); Southern et aL 
(1992); Maskos et aL (1993); Weiler et aL (1997); Montgomery (1998); and Singh- 

20 Gasson et aL ( 1 999). 

The invention may be implemented with microparticles or beads uniformly 
coated with complements of the same tag sequence. Microparticle supports and 
methods of covalently or noncovalently linking oligonucleotides to their surfaces are 
well known, as exemplified by the following references: Beaucage et aL (1992); Gait 

25 (1984); and the references cited above. Generally, the size and shape of a 
microparticle is not critical; however, microparticles in the size range of a few, e.g.. 1- 
2, to several hundred, e.g., 200-1000 |am diameter are preferable, as they facilitate the 
construction and manipulation of large repertoires of oligonucleotide tags with 
minimal reagent and sample usage. 

30 Church (1999) discloses a method for preparing a randomly-patterned array of 

polynucleotides, using in situ amplification methods. In a preferred embodiment, the 
polynucleotides are amplified in situ using the "one-sided" 3SR in situ reaction 
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described above. 

Arrays of fixed microparticles and arrays prepared by other means may be 
replicated according to the methods of Cantor et aL (1998) or Church (1999). In this 
way, even an array comprising randomly patterned tag complements of unknown 

5 sequence may be effectively utilized in some embodiments of this invention. Of 
course, a single array may be utilized many times, but there is always a limit. The 
ability to replicate a randomly patterned array relieves the experimental constraints of 
this limit, permitting for example hundreds of bases to be sequenced (requiring 
hundreds of hybridizations to the array of tag complements) according to the method 

10 of this invention. 

Molecules other than polynucleotides may serve as tag complements. Gold et 
ah (1993 & 1995) teach methods for selecting short polynucleotides that bind to 
polypeptides and small molecules in a sequence-dependent manner. These short 
polynucleotides can be utilized as tags and the molecules to which they bind may 

15 serve as tag complements. Methods for constructing arrays of polypeptides and small 
molecules are disclosed by, for example Pirrung et aL (1995), Matson et aL (1995) 
and Montgomery (1998). In addition, the spotting methods taught by Brown et aL 
(1998) are readily adapted to other molecules. Methods in combinatorial chemistry 
(see for example Wilson et aL, 1997; Gordon et aL, 1 998; Kirk et aL, 1998; Still et aL, 

20 1 996; Horlbeck, 1999) can be used to construct large collections of these molecules 
such that only one molecular species is attached to any one, separate solid support 
(e.g., a bead). These species may be arrayed as described above for polynucleotides. 
Tags that hybridize optimally to these tag complements may be selected en masse as 
described above for polynucleotide tag complements. 

25 6.9 Detecting hybridization to the array 

Methods for hybridizing polynucleotides to arrays of complementary 
polynucleotides are well known in the art. See for example (Lockhart et aL, 1996; 
Wanget aL, 1988; Eisen et aL, 1999; Dugganet aL, 1999; Saiki et aL, 1989). 

Polynucleotides hybridized to the array may be visualized in several different 

30 ways. To facilitate detection, various methods for labeling DNA and constructing 
labeled oligonucleotides are known in the art. Representative methods include 
Mathews et aL (1988), Haugland (1996), Keller et aL (1993), Eckstein (1991), 
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Jablonski et aL (1986), Agrawal et aL (1992), Menchen et aL (1993), Cruickshank 
(1992), Urdea (1992) and Lee et aL (1999). Labels include for example radioactive 
isotopes, fluorescent compounds such as fluorescein and rhodamine, 
chemiluminescent compounds, quantum dots (e.g. see Bruchez et aL, 1998; and Chan 

5 et aL, 1998) and mass tags (e^ Xu et aL 1997; Schmidt, 1999). The polynucleotides 
may be coupled to various enzymes (e.g., p-galactosidase, horseradish peroxidase and 
alkaline phosphatase) and the enzymatic activity is detected with the proper substrate 
(e.g., X-gal, DAB and BCDP, see Ausubel et aL, 1997). The label can be incorporated 
directly into polynucleotides, ej*. tagged amplicons, prior to hybridization to the array. 

10 The label may also be incorporated during an extension reaction of polynucleotides 
after hybridization to the array in which either the tag complements or the hybridized 
polynucleotides act as primers for polymerase (see , for example, Pastinen et aL, 1997) 
Another method to visualize a tagged polynucleotide hybridized to its tag 
complement is to hybridize a third labeled polynucleotide to the tagged 

15 polynucleotide. This third polynucleotide may be ligated to the tag complement to 
increase hybridization specificity in a reaction analogous to the "oligonucleotide 
ligation reaction (OLA)", see Landegren et aL (1988). Alternatively, "oligonucleotide 
stacking" effects of the third oligonucleotide can be used to increase duplex stability, 
see e.g. Lane et aL (1997). 

20 Any imaging system can be utilized that is capable of detecting the label or 

labels, with a resolution appropriate to the size of the array features. Numerous 
examples of imaging apparatus are known in the art. For example, Trulson et aL 
(1998), Pirrung et aL (1992), and Dorsel et aL (1999) describe imaging systems for 
fluorescent labels. Commercial apparatus are available, e.g. ScanArray 4000 (General 

25 Scanning), Biochip Imager (Hewlett Packard), GMS 418 Array Scanner (Genetic 
Microsystems), GeneTAC 1000 (Genomic Solutions), Chip Reader (Virtek). 
Phosphorimager systems are available for detecting radiolabels, ej». Cyclone (Packard 
Instrument company) and BAS-5000 (Fujifilm). 



A sequenced polynucleotide can be utilized in a variety of ways to manipulate 
and discover information about biological systems, for example expression profiling, 
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drug discovery, gene therapy, disease diagnosis, disease treatment, characterization of 
biological circuitry and so on. For some examples and methodologies see, Hawkins et 
ah (1996), Hastings et al. (1996), Guegler et al. (1997), Wachsman et al. (1997), 
Popoff et al. (1997), Carraway et al. (1997), Li et al. (1997), Au-Young et al. (1998), 
5 HiUman et al. (1998), Wei et al. (1998), Levinson et al. (1998); Gimeno et al. (1999), 
Sutcliffe et al. (1999), Wei et al. (1999), Goodearl (1999), Kleyn et al. (1999), Lee et 
al. (1999), and Oin (1999) which are hereby incorporated by reference in their 
entirety. 

One method according to the invention includes the insertion of a nucleic acid, 

10 the sequence of which has been determined according to methods of the invention 
described above, into a vector. The double-stranded form of the nucleic acid generally 
is inserted into the vector by any of a variety of standard molecular cloning techniques 
(see, e.g. , Sambrook et aL, 1989). The nucleic acid can be inserted into the vector in 
either of the two possible orientations: transcription of this sequence yield either a 

15 "sense" transcript (i.e., an mRNA sequence actually produced in cells expressing the 
corresponding gene) or an "antisense" transcript (the complement of an mRNA 
sequence actually produced). Conveniently, the vector is cleaved with a restriction 
endonuclease, and the separated nucleic acid or fragment is ligated into the vector at 
the corresponding restriction endonuclease recognition site. In one embodiment, the 

20 nucleic acid sequence obtained according to the invention contains all coding 
sequences that encode a protein or is inserted into the vector as part of a larger nucleic 
acid sequence that contains all such coding sequences. 

A wide variety of suitable vectors is available, including vectors derived from 
bacterial and yeast plasmids as well as from viruses, e.g.. cosmids, plasmids, phage 

25 derivatives, and phagemids. Examples of bacterially derived vectors include: pBS, 
phagescript, PsiX174, pBluescript SK, pBs KS, pNH8a, pNH16a, pNH18a, and 
pNH46a, which are commercially available from Stratagene, and pTrc99A, pKK223- 
3, pKK233-3, pDR540, and pRIT5, which are commercially available from 
Pharmacia. Examples of eukaryotic vectors include: pWLneo, pSV2cat, pOG44, 

30 PXTI, which are commercially available from Stratagene, and pSVK3, pBPV, pMSG, 
and pSVL, which are commercially available from Pharmacia. However any vector 
capable of replicating in a host cell can be employed. A vector generally has a 
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selectable marker to ensure that the vector will be maintained in host cells. Suitable 
markers include, for example, those conferring resistance to tetracycline or ampicillin 
(useful in prokaryotic ceils) and neomycin (useful in eukaryotic cells). 

The vector can be used simply to propagate the nucleic acid or can be specially 

5 adapted for particular functions. Examples of the latter include probe generation 
vectors and expression vectors. An expression vector allows the expression of an 
amino acid sequence encoded in a nucleic acid or fragment. Typically the latter is 
operatively linked to an expression control sequence (e.g. , a promoter). The term 
"operatively linked" is used herein to denote a relationship in which the expression 

10 control sequence directs the synthesis of mRNA encoding the amino acid sequence to 
be expressed. This term does not imply that the expression control sequence is 
necessarily linked directly to the nucleic acid or fragment. Any promoter known or 
determined to direct transcription of prokaryotic, eukaryotic, or viral genes can be 
employed. Exemplary promoters include the E. coli lac or trp promoters, the early and 

15 late SV40 promoters, the CMV immediate early promoter, the HSV thymidine kinase 
promoter, and the lambda phage Pr and P L promoters. 

Expression vectors also may contain an enhancer sequence, i^ a "cis-acting" 
DNA element that acts on a promoter to increase transcription. Exemplary enhancers 
include those derived from SV40, CMV, polyoma, and adenovirus. Generally, 

20 enhancers are located upstream of and within about 100-300 bp of the promoter. 
Expression vectors can also contain splice donor and acceptor sites, polyadenylation 
sites, and translation initiation and termination sequences in appropriate phase with 
the coding sequence to be expressed. A signal sequence is conveniently included if it 
is not already present in the coding sequence and secretion of the encoded polypeptide 

25 (into the culture medium or periplasmic space) is desired. In one embodiment, an 
expression vector contains a nucleotide sequence that promotes amplification of the 
vector in a host cell under appropriate culture conditions (e.g., culturing in the 
presence of methotrexate for vectors including the dihydrofolate reductase gene). 

In one method, a vector of the invention is introduced into a host cell. The 

30 host cell can, for example, be a prokaryote, a lower eukaryote (e.g. , a fungal cell), or a 
higher eukaryote (e.g. , a mammalian cell). Exemplary prokaryotic host cells include 
E. coli. Bacillus subtilis. Salmonella tvphimurium, and various species within the 
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genera Pseudomonas. Streotomvces , and Staphylococcus, although a wide variety of 
others can be employed. Exemplary eukaryotic cells include yeast cells and higher 
eukaryotic cells such as CHO, COS, or Bowes melanoma cells. The host cell 
employed varies depending on the vector, and the selection of a suitable host cell- 

5 vector system is within the level of skill in the art. When the vector is an expression 
vector, the host cell is typically a mammalian cell, an insect cell, a plant cell, a fungal 
cell (e.g.. a yeast), or a bacterial cell. 

A variety of host-expression vector systems may be utilized to express the 
gene coding sequences of the invention. Such host-expression systems represent 

10 vehicles by which the coding sequences of interest may be produced and subsequently 
purified, but also represent cells which may, when transformed or transfected with the 
appropriate nucleotide coding sequences, exhibit the gene product of the invention in 
situ . These include but are not limited to microorganisms such as bacteria (e.g., E. 
coli, B. subtilise transformed with recombinant bacteriophage DNA, plasmid DNA or 

15 cosmid DNA expression vectors containing the gene product coding sequences; yeast 
(e.g., Saccharomvces. Pichia) transformed with recombinant yeast expression vectors 
containing the gene product coding sequences; insect cell systems infected with 
recombinant virus expression vectors (e.g., baculovirus) containing the gene product 
coding sequences; plant cell systems infected with recombinant virus expression 

20 vectors (e.g., cauliflower mosaic virus, CaMV; tobacco mosaic virus, TMV) or 
transformed with recombinant plasmid expression vectors (e.g., Ti plasmid) 
containing the gene product coding sequences; or mammalian cell systems (e.g., COS, 
CHO, BHK, 293, 3T3) harboring recombinant expression constructs containing 
promoters derived from the genome of mammalian cells (e.g., metallothionein 

25 promoter) or from mammalian viruses (e.g., the adenovirus late promoter; the vaccinia 
virus 7.5K promoter). 

In bacterial systems, a number of expression vectors may be advantageously 
selected depending upon the use intended for the gene product being expressed. For 
example, when a large quantity of such a protein is to be produced, for the generation 

30 of pharmaceutical compositions of the protein or for raising antibodies to the protein, 
vectors which direct the expression of high levels of fusion protein products that are 
readily purified may be desirable. Such vectors include, but are not limited, to the E. 
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coli expression vector pUR278 (Ruther et al, 1983), in which the gene product coding 
sequence may be Hgated individually into the vector in frame with the lac Z coding 
region so that a fusion protein is produced; pIN vectors (Inouye et ah, 1985; Van 
Heeke et aL, 1989); and the like. pGEX vectors may also be used to express foreign 

5 polypeptides as fusion proteins with glutathione S-transferase (GST). In general, such 
fusion proteins are soluble and can easily be purified from lysed cells by adsorption 
and binding to a matrix glutathione-agarose beads followed by elution in the presence 
of free glutathione. The pGEX vectors are designed to include thrombin or factor Xa 
protease cleavage sites so that the cloned target gene product can be released from the 

10 GST moiety. 

In an insect system, Autographa californica nuclear polyhedrosis virus 
(AcNPV) is used as a vector to express foreign genes. The virus grows in Spodoptera 
frugiperda cells. The gene coding sequence may be cloned individually into non- 
essential regions (for example the polyhedrin gene) of the virus and placed under 

15 control of an AcNPV promoter (for example the polyhedrin promoter). Successful 
insertion of the gene coding sequence will result in inactivation of the polyhedrin gene 
and production of non-occluded recombinant virus (i.e., virus lacking the 
proteinaceous coat coded for by the polyhedrin gene). These recombinant viruses are 
then used to infect Spodoptera frugiperda cells in which the inserted gene is 

20 expressed, (e^ see Smith etaL, 1983; Smith et aL, U.S. Pat. No. 4,745,051). 

In mammalian host cells, a number of viral-based expression systems may be 
utilized. In cases where an adenovirus is used as an expression vector, the gene coding 
sequence of interest may be ligated to an adenovirus transcription/translation control 
complex, ej*., the late promoter and tripartite leader sequence. This chimeric gene 

25 may then be inserted in the adenovirus genome by in vitro or in vivo recombination. 
Insertion in a non-essential region of the viral genome (e.g., region El or E3) will 
result in a recombinant virus that is viable and capable of expressing the gene product 
in infected hosts, (e.g., see Logan et al, 1984). Specific initiation signals may also be 
required for efficient translation of the inserted gene product coding sequences. These 

30 signals include the ATG initiation codon and adjacent sequences. In cases where an 
entire gene, including its own initiation codon and adjacent sequences, is inserted into 
the appropriate expression vector, no additional translational control signals may be 
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needed. However, in cases where only a portion of the gene coding sequence is 
inserted, exogenous translational control signals, including, perhaps, the ATG 
initiation codon, must be provided. Furthermore, the initiation codon must be in phase 
with the reading frame of the desired coding sequence to ensure translation of the 
5 entire insert. These exogenous translational control signals and initiation codons can 
be of a variety of origins, both natural and synthetic. The efficiency of expression may 
be enhanced by the inclusion of appropriate transcription enhancer elements, 
transcription terminators, etc^ (see Bitter et aL, 1987). 

In addition, a host cell strain may be chosen which modulates the expression 
10 of the inserted sequences, or modifies and processes the gene product in the specific 
fashion desired. Such modifications (e.g. , glycosylation) and processing (e.g., 
cleavage) of protein products may be important for the function of the protein. 
Different host cells have characteristic and specific mechanisms for the post- 
translational processing and modification of proteins and gene products. Appropriate 
15 cell lines or host systems can be chosen to ensure the correct modification and 
processing of the foreign protein expressed. To this end, eukaryotic host cells which 
possess the cellular machinery for proper processing of the primary transcript, 
glycosylation, and phosphorylation of the gene product may be used. Such 
mammalian host cells include but are not limited to CHO, VERO, BHK, HeLa, COS, 
20 MDCK, 293, 3T3, WI38, and in particular, T cell lines such as, for example, Jurkat, 
CTLL, HT2, Dorris, Dl.l, AE7, D10.G4 and CDC25. 

The vector can be introduced into the host cell by any effective technique, such 
as transformation, transfection, infection, or transduction. Convenient transfection 
techniques include calcium phosphate transfection, DEAE-dextran-mediated 
25 transfection, and electroporation. The host cell containing the vector then can be 
cultured in a conventional nutrient medium, modified as appropriate for selecting 
vector-containing cells, inducing or derepressing a promoter, and/or amplifying a 
vector DNA sequence. Otherwise, the culture conditions employed, such as pH and 
temperature, are those suitable for the particular host cell. Suitable culture conditions 
30 are known to, or can be readily determined, by those skilled in the art. If desired, 
vector DNA can be prepared from a host cell culture using any of a number of 
standard techniques. 
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A host cell containing an expression vector can be cultured under conditions 
that allow expression of the encoded polypeptide. Typically, host cells are allowed to 
grow to an appropriate density, and then a promoter linked to the nucleic acid to be 
expressed is induced or derepressed (e.g., by temperature shift or chemical induction) 
and/or a linked enhancer is activated. The host cells are cultured for an additional 
period and then harvested, typically by centrifugation. If the expressed polypeptide 
was secreted into the culture medium, the polypeptide is recovered from the culture 
medium. Alternatively, if the expressed polypeptide was retained in the host cells, the 
cells are disrupted by physical or chemical means, and the polypeptide is recovered 
from the resulting crude extract. 

The polypeptide can be purified from the culture medium or crude extract 
using standard protein purification techniques. Suitable methods include ammonium 
sulfate or ethanol precipitation, acid extraction, anion or cation exchange 
chromatography, phosphocellulose chromatography, hydrophobic interaction 
chomatography, affinity chomatography, hydroxylapatite chromatography, and lectin 
chromatography, etc.. and combinations thereof. The purification strategy also may 
include a protein refolding step to provide a polypeptide having the proper structure. 
High performance liquid chromatography can be employed, typically as one of the 
final purification steps. Depending on the method of production, the polypeptide can 
have methionine as the initial amino acid residue and can be glycosylated or 
non-glycosylated. 

In an alternative embodiment, a polypeptide is expressed by translating an 
mRNA corresponding to a nucleic acid whose sequence has been determined 
according to the methods of the invention in a cell-free translation system. 

The amino acid sequence of a polypeptide encoded by a nucleic acid whose 
sequence has been determined according to the methods of the invention can be 
compared to that of previously characterized polypeptides to identify one or more 
biological functions of the polypeptide. A comparison of the nucleotide sequence of a 
selected amino acid with the nucleotide sequence of previously characterized genes 
can also indicate a biological function. Biological functions of particular interest 
include the ability to bind to a ligand or a receptor, the ability to form an ion channel, 
the ability to couple with a GTP-binding protein, the ability to phosphorylate or be 
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phosphorylated by another polypeptide, and to otherwise modulate the activity of 
another molecule that plays a role in a signal transduction pathway. 

The polypeptide or a fragment thereof can be employed in a screening assay to 
identify compounds and/or molecules that stimulate (agonists) or inhibit (antagonists) 

5 the biological function of the polypeptide. In addition, if the identified biological 
function includes a binding activity, the polypeptide can be employed in an assay to 
detect the presence of the binding partner in cells or tissues. 

Moreover, the polypeptide can be used to generate antibodies that stimulate or 
inhibit the activity of the polypeptide or that bind the polypeptide without affecting 

10 activity. As used herein, the term "antibody" refers to a molecule including any 
binding-competent portion of an antibody, such as, for example, a single chain 
antibody or a Fab fragment. The term encompasses molecules in which such binding 
competent portions are covalently attached to other polypeptide sequences, as in dual- 
specificity antibodies. An antibody specific for the polypeptide can be polyclonal or 

15 monoclonal. Polyclonal antibodies are produced by immunizing an animal, preferably 
a mammal, with the polypeptide or an immunogenic fragment thereof and collecting 
the antiserum. The antiserum can be screened for the desired binding activity, and 
antibodies with undesirable cross-reactivities can be removed by contacting the 
antiserum with the corresponding agent(s) and recovering the non-bound component 

20 of the antiserum. Monoclonal antibodies can be produced by any convenient 
technique, including the hybridoma technique (Kohler et aL, 1975), the trioma 
technique, the human B-cell hybridoma technique (Kozbor et aL, 1983), and the 
EBV-hybridoma technique, which produces human monoclonal antibodies (Cole et 
aL, 1985). Humanized antibodies can be produced using a transgenic animal, and 

25 single chain antibodies can be produced as described by Ladner et aL (1990). 
Antibodies that specifically bind a polypeptide encoded by a nucleic acid whose 
sequence has been determined according to the methods of the invention are useful in 
affinity purification of the polypeptide and for detecting the presence of and/or 
quantitating the amount of a polypeptide in a sample. For instance, antibodies can be 

30 employed in immunohistochemistry studies to determine the localization of the 
polypeptide in cells of a tissue sample. In such studies, the polypeptide-specific 
antibody is labeled with a detectable label, such as, for example, an enzyme label. 
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The label can be attached to the polypeptide-specific antibody directly or indirectly 
(e.g. , attachment to a secondary antibody specific for the polypeptide-specific 
antibody). Antibody binding generally is detected by adding a substrate for the 
enzyme and detecting conversion of the substrate to a product, usually via a color 
5 change. 

In yet another embodiment of the invention, sequences determined according 
to the method of the invention may be used to design probes useful for detecting the 
presence of a nucleic acid sequence complementary to the probe sequence. The 
detection of such sequences can provide the basis of diagnostic tests, or alternatively, 

10 may be useful for basic research purposes. Such probe sequences may be synthesized 
using a commercially available nucleic acid synthesizer, such as the ABI Model 394 
or may be generated from restriction fragments of the cloned library element from 
which the sequence was determined, and sub-cloning the restriction fragment into a 
probe vector such as, e^, pBluescript SK (Stratagene), pSP72 (Promega), M13mpl8 

15 (New England Biolabs) and the like. The cloned library element from which the 
sequence was determined may be recovered by a variety of methods, including 
probing the library with the tag sequence corresponding to the desired sequence, or 
immobilizing the tag sequence (or its complement) on a solid support, and hybridizing 
the library with the solid support to specifically recover the desired library element. 

20 Such methods are well known in the art, see e^ Ausubel (1997) and Brenner (1997a). 
In addition, multiple probes may be arrayed (e.g. Brown, 1998) or may be synthesized 
as oligonucleotides in an array (e.g. Fodor et ah, 1995) as described above in section 
6.8. 

Sequences determined according to the method of the invention also may be 
25 used to design primers for amplification of nucleic acid molecules via methods such 
as» e.g., PCR. Designing PCR primers from known sequences is well within the art. 
Relevant considerations are discussed in, e^g., DiefTenbach et aL (1995) and Innis et 
aL (1990). PCR, using primers designed from sequences determined according to the 
method of the invention, may be used as the basis of a diagnostic test to determine the 
30 presence of a nucleic acid sequence in a sample, or, alternatively, simply to provide 
large quantities of nucleic acid for other uses. 
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6.10.1 Polynucleotide homologs 
Another method according to the invention involves identifying 
polynucleotides that are homologous at the nucleotide or encoded amino acid level to 
a parent polynucleotide or parent gene sequenced by the methods described above. 

5 Homologs may be isolated from the same species as the parent polynucleotide or from 
a different species. Homologs may not occur naturally, but instead they may be 
constructed from the parent polynucleotide by random or site-directed mutagenesis as 
described below. By definition the parent polynucleotide is a homolog of itself. 

A highly homologous, polynucleotide preferably exhibits at least about 80% 

10 overall similarity at the nucleotide level to the parent polynucleotide, more preferably 
exhibits at least about 85-90% overall similarity, and most preferably exhibits at least 
about 95% overall similarity to the parent polynucleotide. However, because of the 
degeneracy of the genetic code, two polynucleotides that encode highly homologous 
polypeptides may not necessarily exhibit extensive similarity at the nucleotide level. 

15 In particular, site directed mutagenesis can be used to produce two polynucleotides 
that encode the same polypeptide, but share less than 67% similarity at the nucleotide 
level. 

Homologous polynucleotides, exhibiting extensive homology to one or more 
domains of the parent polynucleotide can be identified and readily isolated, without 

20 undue experimentation, by molecular biological techniques well known in the art. 
Further, there can exist homologous genes at other genetic loci within the genome that 
encode proteins which have extensive homology to one or more domains encoded by 
a parent gene. These genes can also be identified via similar techniques. Still further, 
there can exist alternatively spliced variants of the parent gene. 

25 As an example, in order to clone a human gene homolog or variants using a 

sequenced murine polynucleotide, the murine polynucleotide or sequence element is 
labeled and used to screen a cDNA library constructed from mRNA obtained from 
appropriate cells or tissues derived from the organism (in this case, human) of interest. 
The hybridization and wash conditions used should be of a low stringency when the 

30 cDNA library is derived from a different type of organism than the one from which 
the labeled sequence was derived. Low stringency conditions are well known to those 
of skill in the art, and will vary predictably depending on the specific organisms from 
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which the library and the labeled sequences are derived. For guidance regarding such 
conditions see, for example, Sambrook et aL (1989) and Ausubel et ah (1 989). 

With respect to the cloning of a human homolog, using a murine 
polynucleotide, for example, various stringency conditions which promote DNA 

5 hybridization can be used. For example, hybridization in 6X SSC at about 45°C, 
followed by washing in 2X SSC at 50°C may be used. Alternatively, the salt 
concentration in the wash step can range from low stringency of about 5X SSC at 
50°C, to moderate stringency of about 2X SSC at 50°C, to high stringency of about 
0.2X SSC at 50°C. In addition, the temperature of the wash step can be increased from 

10 low stringency conditions at room temperature, to moderately stringent conditions at 
about 42°C, to high stringency conditions at about 65°C. Other conditions include, but 
are not limited to, hybridizing at 68°C in 0.5M NaHP0 4 (pH7.2)/ 1 mM EDTA/ 7% 
SDS, or hybridization in 50% formamide/0.25M NaHP0 4 (pH 7.2)/0.25 M NaCl/1 
mM EDTA/7% SDS; followed by washing in 40 mM NaHP0 4 (pH 7.2)/l mM 

15 EDTA/5% SDS at 50°C or in 40 mM NaHP0 4 (pH7.2) 1 mM EDTA/1% SDS at 
50°C. Both temperature and salt may be varied, or alternatively, one or the other 
variable may remain constant while the other is changed. 

Alternatively, the labeled fragment may be used to screen a genomic library 
derived from the organism of interest, again, using appropriately stringent conditions 

20 well known to those of skill in the art. 

Further, a homologous polynucleotide may be isolated from nucleic acid of the 
organism of interest by performing PCR using two degenerate oligonucleotide primer 
pools designed on the basis of amino acid sequences within the parent polynucleotide 
as described by e^g, Innis et ah (1990) and Wilkie et aL (1994). The template for the 

25 reaction may be genomic DNA or cDNA obtained by reverse transcription of mRNA 
prepared from, for example, human or non-human cell lines or tissue known or 
suspected to express the polynucleotide. 

The PCR product may be subcloned and sequenced to ensure that the 
amplified sequences represent homologous polynucleotides. The PCR fragment may 

30 then be used to isolate a full length cDNA clone by a variety of methods. For example, 
the amplified fragment may be labeled and used to screen a cDNA library, such as a 
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bacteriophage cDNA library. Alternatively, the labeled fragment may be used to 
isolate genomic clones via the screening of a genomic library. 

Homologous polynucleotides of the invention further include isolated 
polynucleotides which hybridize under highly stringent or moderate stringent 
conditions to at least about 6, preferably about 12, more preferably about 18, 
consecutive nucleotides of the parent polynucleotide. The invention also includes 
polynucleotides, preferably DNA molecules, that hybridize to, and are therefore the 
complements of, the parent polynucleotide. Such hybridization conditions may be 
highly stringent or moderately stringent, as described above. In instances wherein the 
nucleic acid molecules are short oligonucleotides highly stringent conditions may 
refer, e.g., to washing in 6X SSC/50mM sodium pyrophosphate at 37°C (for 14-base 
oligos), 48°C (for 17-base oligos), 55°C (for 20-base oligos), and 60°C (for 23-base 
oligos). These nucleic acid molecules may encode or act as antisense molecules 
useful, for example, in gene regulation. Further, such sequences may be used as part 
of ribozyme and/or triple helix sequences, also useful for gene regulation. Still further, 
such molecules may be used as components of diagnostic methods whereby, for 
example, the presence of a particular allele or alternatively spliced transcript 
responsible for a mutant phenotype may be detected. 

PCR technology may be utilized to isolate full length cDNA sequences. For 
example, RNA may be isolated, following standard procedures, from an appropriate 
cellular or tissue source. A reverse transcription reaction may be performed on the 
RNA using an oligonucleotide primer specific for the most 5* end of the amplified 
fragment for the priming of first strand synthesis. The resulting RNA/DNA hybrid 
may then be "tailed" with guanines using a standard terminal transferase reaction, the 
hybrid may be digested with RNAase H, and second strand synthesis may then be 
primed with a poly-C primer. Thus, cDNA sequences upstream of the amplified 
fragment may easily be isolated. For a review of cloning strategies which may be 
used, see e^, Sambrook et aL (1989) and Ausubel et ah (1997). 

6.10.2 Expression analysis 

Quantitative and qualitative aspects of gene expression of polynucleotides 
sequenced according to this invention can also be assayed. For example, RNA from a 
cell type or tissue known, or suspected, to express a gene may be isolated and tested 
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utilizing hybridization or PCR techniques. The isolated cells can be derived from cell 

culture or from a patient. The analysis of cells taken from culture may be a necessary 

step in the assessment of cells to be used as part of a cell-based gene therapy 

technique or, alternatively, to test the effect of compounds on the expression of the 
5 gene. Such analyses may reveal both quantitative and qualitative aspects of the 

expression pattern of the gene, including activation or inactivation of gene expression 

and presence of alternatively spliced transcripts. 

In one embodiment of such a detection scheme, a cDNA molecule is 

synthesized from an RNA molecule of interest (e.g., by reverse transcription of the 
10 RNA molecule into cDNA). All or part of the resulting cDNA is then used as the 

template for a nucleic acid amplification reaction, such as a PCR amplification 

reaction, or the like. 

For detection of the amplified product, the nucleic acid amplification may be 

performed using radioactively or non-radioactively labeled nucleotides. Alternatively, 
15 enough amplified product may be made such that the product may be visualized by 

standard ethidium bromide staining or by utilizing any other suitable nucleic acid 

staining method. 

Such RT-PCR techniques can be utilized to detect differences in transcript size 
which may be due to normal or abnormal alternative splicing. Additionally, such 
20 techniques can be performed using standard techniques to detect quantitative 
differences between levels of full length and/or alternatively spliced transcripts 
detected in normal individuals relative to those individuals exhibiting a phenotype of 
interest. 

In the case where detection of specific alternatively spliced species is desired, 
25 appropriate primers and/or hybridization probes can be used, such that, in the absence 
of such sequence, no amplification would occur. Primers are chosen which will yield 
fragments of differing size depending on whether a particular exon is present or absent 
from the transcript being utilized. 

As an alternative to amplification techniques, standard Northern analyses can 
30 be performed if a sufficient quantity of the appropriate cells can be obtained. Utilizing 
such techniques, quantitative as well as size related differences between transcripts 
can also be detected. 
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Additionally, it is possible to perform such gene expression assays "in shu", 
i.e., directly upon tissue sections (fixed and/or frozen) of patient tissue obtained from 
biopsies or resections, such that no nucleic acid purification is necessary. Nucleic acid 
reagents such as those described in Section 6.1 may be used as probes and/or primers 

5 for such in situ procedures (see , for example, Nuovo, 1 992). 

Gene expression may also be assayed "en masse" utilizing polynucleotide 
arrays (see for example Lockhardt, 1996; Schena et ah, 1995; etcj. Sequence from 
polynucleotides can be used to design arrays for synthesis or to determine the identity 
of the polynucleotides at any particular address in the array. 

10 Another method to assay gene expression en masse is simply to sequence 

cDNA from a cell population by the massively parallel method described above. This 
technique can be coupled with a second parallel method such as SAGE (serial analysis 
of gene expression, Velculescu et aL, 1995) to permit analysis of even the rarest 
transcripts with a single sequencing reaction. cDNA from different sources (for 

15 example diseased vs. normal tissue, cells with and without drug, tissue from different 
developmental states, etc.) can be compared to determine the differentially expressed 
genes (see e.g. Kozian et aL, 1999). 

6.10.3 Screening assays for compounds that 
modulate the activity of a gene 

20 product 

Screening assays may be designed to identify compounds capable of 
interacting with, e.g., binding to, a polypeptide or gene product that is sequenced and 
characterized as described above. Methods are well known in the art, see for example 
Wolff (1995), Foye et aL (1995), and Hansen et ah (1990). The following assays are 

25 designed to identify: (i) compounds that bind to gene products; (ii) compounds that 
bind to other intracellular proteins that interact with a gene product; (iii) compounds 
that interfere with the interaction of a gene product with other intracellular proteins; 
and (iv) compounds that modulate the activity of a gene (Le^, modulate the level of 
gene expression and/or modulate the level of a gene product activity). Compounds 

30 may include, but are not limited to, peptides such as, for example, soluble peptides, 
and small organic or inorganic molecules. Methods for synthesizing compounds are 
well known in the art. Combinatorial synthesis and other high throughput synthesis 
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methods as well as high throughput screening assays have been described, see for 
example, Wolff (1995), Burnbaum et aL (1999), Parce et aL (1999), Chelsky et aL 
(1999); Horlbeck (1999); Devlin (1997); Venton et aL (1998); Kirk et ah (1998); and 
Still etaL( 1996). 

5 Assays additionally may be utilized which identify compounds that bind to 

gene regulatory sequences (e.g., promoter sequences), see e.g.. Piatt (1994), which 
may modulate the level of gene expression. Methods for the identification of such 
intracellular proteins are described below. 

Compounds identified via assays such as those described herein may be useful, 
10 for example, in elaborating the biological function of a gene product, and for 
ameliorating symptoms of disease. It is to be noted that the invention includes 
methods to identify such pharmaceutical compositions pertaining to polynucleotides 
characterized according to the invention. Such pharmaceutical compositions can be 
formulated, for example, as discussed below. 
15 6.10.4 In vitro screening assays for 

compounds that bind to a gene 
product 

In vitro systems may be designed to identify compounds capable of interacting 
with, e^g., binding to, a polypeptide that is sequenced and characterized according to 

20 this invention. Compounds identified may be useful, for example, in modulating the 
activity of wild type and/or mutant gene products, may be useful in elaborating the 
biological function of the a gene product, may be utilized in screens for identifying 
compounds that disrupt normal gene product interactions, or may in themselves 
disrupt such interactions. 

25 The principle of the assays used to identify compounds that interact with a 

gene product involves preparing a reaction mixture of the gene product and the test 
compound under conditions and for a time sufficient to allow the two components to 
interact with, e.g.. bind to, thus forming a complex, which can represent a transient 
complex, which can be removed and/or detected in the reaction mixture. These assays 

30 can be conducted in a variety of ways. For example, one method to conduct such an 
assay would involve anchoring a gene product or the test substance onto a solid phase 
and detecting the gene product/test compound complexes anchored on the solid phase 
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at the end of the reaction. In one embodiment of such a method, the gene product may 
be anchored onto a solid surface, and the test compound, which is not anchored, may 
be labeled, either directly or indirectly. 

In practice, microtiter plates may conveniently be utilized as the solid phase. 
The anchored component may be immobilized by non-covalent or covalent 
attachments. Non-covalent attachment may be accomplished by simply coating the 
solid surface with a solution of the protein and drying. Alternatively, an immobilized 
antibody, preferably a monoclonal antibody, specific for the protein to be immobilized 
may be used to anchor the protein to the solid surface. The surfaces may be prepared 
in advance and stored. 

In order to conduct the assay, the non-immobilized component is added to the 
coated surface containing the anchored component. After the reaction is complete, 
unreacted components are removed (e.g. , by washing) under conditions such that any 
complexes formed will remain immobilized on the solid surface. The detection of 
complexes anchored on the solid surface can be accomplished in a number of ways. 
Where the previously non-immobilized component is pre-labeled, the detection of 
label immobilized on the surface indicates that complexes were formed. Where the 
previously non-immobilized component is not pre-labeled, an indirect label can be 
used to detect complexes anchored on the surface; e.g.. using a labeled antibody 
specific for the previously non-immobilized component (the antibody, in turn, may be 
directly labeled or indirectly labeled with a labeled anti-Ig antibody). 

Alternatively, a reaction can be conducted in a liquid phase, the reaction 
products separated from unreacted components, and complexes detected; e.g.. using 
an immobilized antibody specific for the gene product or the test compound to anchor 
any complexes formed in solution, and a labeled antibody specific for the other 
component of the possible complex to detect anchored complexes. 

6.10.5 Rational design of compounds that 
internet with a gene product 

The 3-dimensional structure of a gene product can be determined empirically 
using techniques such as crystallography (see for example, McRee, 1999; Drenth, 
1999) and NMR (see for example, Cavanagh et aL, 1996; Krishna et ah, 1999). In 
some cases, the structure can be predicted from the primary amino acid sequence by 
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homology comparisons to known structures. 

Knowledge of the 3-dimensional structure permits rational design of 
compounds that may interact with and influence the activity of the gene product, see 
for example Veerapandian (1995); Martin (1989); Keseru et aL (1999); Weiner, D.B. 
5 et aL (1994), and Weiner, D.B. etaL (1995). 

6.10.6 Assays for intracellular proteins that 
interact with a gene product 
Any method suitable for detecting protein-protein interactions may be 
employed to identify intracellular proteins that interact with a gene product 
10 characterized according to the method of this invention. Among the traditional 
methods which may be employed are co-immunoprecipitation, crosslinking and co- 
purification through gradients or chromatographic columns. Utilizing procedures such 
as these allows for the isolation of intracellular proteins which interact with gene 
products. Once isolated, such an intracellular protein can be identified and can, in 
15 turn, be used, in conjunction with standard techniques, to identify additional proteins 
with which it interacts. 

Additionally, methods may be employed which result in the simultaneous 
identification of genes which encode the intracellular protein interacting with the gene 
product. These methods include, for example, probing expression libraries with the 
20 labeled gene product, using the labeled protein in a manner similar to the well known 
technique of antibody probing of Xgtl 1 libraries. 

One method which detects protein interactions in vivo , the two-hybrid system, 
is described in detail for illustration only and not by way of limitation. One version of 
this system has been described (Chien et aL, 1991) and is commercially available from 
25 Clontech (Palo Alto, Calif.). 

Briefly, utilizing such a system, plasmids are constructed that encode two 
hybrid proteins: one consists of the DNA-binding domain of a transcription activator 
protein fused to the characterized gene product and the other consists of the 
transcription activator protein's activation domain fused to an unknown protein that is 
30 encoded by a cDNA which has been recombined into this plasmid as part of a cDNA 
library. The DNA-binding domain fusion plasmid and the cDNA library are 
transformed into a strain of the yeast Saccharomvces cerevisiae that contains a 
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reporter gene (e.g., HBS or lacZ) whose regulatory region contains the transcription 
activator's binding site. Either hybrid protein alone cannot activate transcription of the 
reporter gene: the DNA-binding domain hybrid cannot because it does not provide 
activation function and the activation domain hybrid cannot because it cannot localize 

5 to the activator's binding sites. Interaction of the two hybrid proteins reconstitutes the 
functional activator protein and results in expression of the reporter gene, which is 
detected by an assay for the reporter gene product. 

The two-hybrid system or related methodology may be used to screen 
activation domain libraries for proteins that interact with the "bait" gene product. 

10 Total genomic or cDNA sequences are fused to the DNA encoding an activation 
domain. This library and a plasmid encoding a hybrid of the bait gene product fused to 
the DNA-binding domain are cotransformed into a yeast reporter strain, and the 
resulting transformants are screened for those that express the reporter gene. Positive 
colonies are purified and the library plasmids responsible for reporter gene expression 

15 are isolated. DNA sequencing then is used to identify the proteins encoded by the 
library plasmids. 

For example, the bait gene product can be cloned into a vector such that it is 
translationally fused to the DNA encoding the DNA-binding domain of the GAM 
protein. A cDNA library of the cell line from which proteins that interact with the bait 

20 gene product are to be detected can be made using methods routinely practiced in the 
art. The cDNA fragments can be inserted into a vector such that they are 
translationally fused to the transcriptional activation domain of GAL4. This library 
can be co-transformed along with the bait gene-GAL4 fusion plasmid into a yeast 
strain which contains a lacZ gene driven by a promoter which contains GAL4 

25 activation sequence. A cDNA encoded protein, fused to GAL4 transcriptional 
activation domain, that interacts with the bait gene product will reconstitute an active 
GAL4 protein and thereby drive expression of the HIS3 gene. Colonies which express 
HIS3 can be detected by their growth on petri dishes containing semi-solid agar based 
media lacking histidine. The cDNA can then be purified from these strains, and used 

30 to produce and isolate the bait gene-interacting protein using techniques routinely 
practiced in the art. 
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6.10.7 Assays for compounds that interfere 
with the inlcraction between a gene 
product and an intracellular 
macromolecule 

5 A characterized gene product of the invention may, in vivo , interact with one 

or more intracellular macromolecules, such as proteins. Such macromolecules may 
include, but are not limited to, nucleic acid molecules and those proteins identified via 
methods such as those described above. For purposes of this discussion, such 
intracellular macromolecules are referred to herein as "interacting partners." 

10 Compounds that disrupt interactions in this way may be useful in regulating the 
activity of the gene product, including mutant gene products. Such compounds may 
include, but are not limited to molecules such as peptides, and the like, as described 
above, which would be capable of gaining access to the intracellular gene product. 

The basic principle of the assay systems used to identify compounds that 

15 interfere with the interaction between the gene product and its intracellular interacting 
partner or partners involves preparing a reaction mixture containing the gene product, 
and the interacting partner under conditions and for a time sufficient to allow the two 
to interact and bind, thus forming a complex. In order to test a compound for 
inhibitory activity, the reaction mixture is prepared in the presence and absence of the 

20 test compound. The test compound may be initially included in the reaction mixture, 
or may be added at a time subsequent to the addition of the gene product and its 
intracellular interacting partner. Control reaction mixtures are incubated without the 
test compound or with a placebo. The formation of any complexes between the gene 
product and the intracellular interacting partner is then detected. The formation of a 

25 complex in the control reaction, but not in the reaction mixture containing the test 
compound, indicates that the compound interferes with the interaction of the gene 
protein and the interacting partner. Additionally, complex formation within reaction 
mixtures containing the test compound and normal gene product may also be 
compared to complex formation within reaction mixtures containing the test 

30 compound and a mutant gene product. This comparison may be important in those 
cases wherein it is desirable to identify compounds that disrupt interactions of mutant 
but not normal gene products. 
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The assay for compounds that interfere with the interaction of the gene 
product and interacting partners can be conducted in a heterogeneous or homogeneous 
format. Heterogeneous assays involve anchoring either the gene product or the 
binding partner onto a solid phase and detecting complexes anchored on the solid 
5 phase at the end of the reaction. In homogeneous assays, the entire reaction is carried 
out in a liquid phase. In either approach, the order of addition of reactants can be 
varied to obtain different information about the compounds being tested. For example, 
test compounds that interfere with the interaction between the gene product and the 
interacting partners, e.g.. by competition, can be identified by conducting the reaction 
10 in the presence of the test substance; Le^, by adding the test substance to the reaction 
mixture prior to or simultaneously with the gene product and intracellular interacting 
partner. Alternatively, test compounds that disrupt pre-formed complexes, ej>. 
compounds with higher binding constants that displace one of the components from 
the complex, can be tested by adding the test compound to the reaction mixture after 
15 complexes have been formed. The various formats are described briefly below. 

In a heterogeneous assay system, either the gene product or the interacting 
partner, is anchored onto a solid surface, while the non-anchored species is labeled, 
either directly or indirectly. In practice, microtiter plates are conveniently utilized. The 
anchored species may be immobilized by non-covalent or covalent attachments. Non- 
20 covalent attachment may be accomplished simply by coating the solid surface with a 
solution of the gene product or interacting partner and drying. Alternatively, an 
immobilized antibody specific for the species to be anchored may be used to anchor 
the species to the solid surface. The surfaces may be prepared in advance and stored. 

To conduct the assay, the partner of the immobilized species is exposed to the 
25 coated surface with or without the test compound. After the reaction is complete, 
unreacted components are removed (e.g., by washing) and any complexes formed will 
remain immobilized on the solid surface. The detection of complexes anchored on the 
solid surface can be accomplished in a number of ways. Where the non-immobilized 
species is pre-labeled, the detection of label immobilized on the surface indicates that 
30 complexes were formed. Where the non-immobilized species is not pre-labeled, an 
indirect label can be used to detect complexes anchored on the surface; e.g., using a 
labeled antibody specific for the initially non-immobilized species (the antibody, in 
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turn, may be directly labeled or indirectly labeled with a labeled anti-Ig antibody). 
Depending upon the order of addition of reaction components, test compounds which 
inhibit complex formation or which disrupt pre- formed complexes can be detected. 

Alternatively, the reaction can be conducted in a liquid phase in the presence 

5 or absence of the test compound, the reaction products separated from unreacted 
components, and complexes detected; e.g., using an immobilized antibody specific for 
one of the interacting components to anchor any complexes formed in solution, and a 
labeled antibody specific for the other partner to detect anchored complexes. Again, 
depending upon the order of addition of reactants to the liquid phase, test compounds 

10 which inhibit complex or which disrupt pre-formed complexes can be identified. 

In an alternate embodiment of the invention, a homogeneous assay can be 
used. In this approach, a pre-formed complex of the gene protein and the interacting 
partner is prepared in which either the gene product or its interacting partner is 
labeled, but the signal generated by the label is quenched due to complex formation 

15 (see, e.g. , Rubenstein et aL (1980) which utilizes this approach for immunoassays). 
The addition of a test substance that competes with and displaces one of the species 
from the pre-formed complex will result in the generation of a signal above 
background. In this way, test substances which disrupt the gene product/intracellular 
interacting partner interaction can be identified. 

20 In a particular embodiment, the gene product can be prepared for 

immobilization using recombinant DNA techniques described above. For example, 
the gene product coding region can be fused to a glutathione-S-transferase (GST) gene 
using a fusion vector, such as pGEX-5X-l, in such a manner that its interacting 
activity is maintained in the resulting fusion protein. The intracellular interacting 

25 partner can be purified and used to raise a monoclonal antibody, using methods 
routinely practiced in the art and described above. This antibody can be labeled with 
the radioactive isotope 125 I, for example, by methods routinely practiced in the art. In a 
heterogeneous assay, e.g., the GST-gene product fusion protein can be anchored to 
glutathione-agarose beads. The intracellular interacting partner can then be added in 

30 the presence or absence of the test compound in a manner that allows interaction, e.g., 
binding, to occur. At the end of the reaction period, unbound material can be washed 
away, and the labeled monoclonal antibody can be added to the system and allowed to 
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bind to the complexed components. The interaction between the gene product and the 
intracellular interacting partner can be detected by measuring the amount of 
radioactivity that remains associated with the glutathione-agarose beads. A successful 
inhibition of the interaction by the test compound will result in a decrease in measured 
5 radioactivity. 

Alternatively, the GST fusion protein and the intracellular interacting partner 
can be mixed together in liquid in the absence of the solid glutathione-agarose beads. 
The test compound can be added either during or after the species are allowed to 
interact. This mixture can then be added to the glutathione-agarose beads and unbound 

10 material is washed away. Again the extent of inhibition of the gene product/interacting 
partner interaction can be detected by adding the labeled antibody and measuring the 
radioactivity associated with the beads. 

In another embodiment of the invention, these same techniques can be 
employed using peptide fragments that correspond to the binding domains of the gene 

15 product and/or the intracellular interacting partner, in place of one or both of the full 
length proteins. Any number of methods routinely practiced in the art can be used to 
identify and isolate the binding sites. These methods include, but are not limited to, 
mutagenesis of the gene encoding one of the proteins and screening for disruption of 
binding in a co-immunoprecipitation assay. Compensating mutations in the gene 

20 encoding the second species in the complex can then be selected. Sequence analysis of 
the genes encoding the respective proteins will reveal the mutations that correspond to 
the region of the protein involved in interacting, e.g., binding. Alternatively, one 
protein can be anchored to a solid surface using methods described in this Section 
above, and allowed to interact with, e.g., bind, to its labeled interacting partner, which 

25 has been treated with a proteolytic enzyme, such as trypsin. After washing, a short, 
labeled peptide comprising the interacting, e.g., binding, domain may remain 
associated with the solid material, which can be isolated and identified by amino acid 
sequencing. Also, once the gene coding for the intracellular binding partner is 
obtained, short gene segments can be engineered to express peptide fragments of the 

30 protein, which can then be tested for binding activity and purified or synthesized. 

For example, and not by way of limitation, the gene product can be anchored 
to a solid material as described, above, in this Section by making a GST fusion protein 
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and allowing it to bind to glutathione agarose beads. The interactive intracellular 
binding partner can be labeled with a radioactive isotope, such as 35 S, and cleaved 
with a proteolytic enzyme such as trypsin. Cleavage products can then be added to the 
anchored GST fusion protein and allowed to bind. After washing away unbound 

5 peptides, labeled bound material, representing the intracellular interacting partner 
binding domain, can be eluted, purified, and analyzed for amino acid sequence by 
well-known methods. Peptides so identified can be produced synthetically or fused to 
appropriate facilitative proteins using recombinant DNA technology. 

In another embodiment, a two-hybrid screening assay could be used to identify 

10 drugs that block the interaction between the gene product and an interacting partner 
(see for example Vidal et aL, 1999). This strategy would employ a two-hybrid 
containing yeast strain whose growth on synthetic complete medium lacking L- 
histidine is conditional on the physical interaction between the gene product and an 
interacting partner. In one example of such an embodiment, the strain would be spread 

15 in a thin lawn on a plate made of synthetic complete medium lacking L-histidine. 
Filter disks containing test compounds would be applied to the plates. Most test 
compounds would not affect the interaction between the gene product and the 
interacting partner and consequently a confluent lawn of yeast would grow around the 
disks impregnated with such compounds. Test compounds that inhibit the interaction 

20 would block growth of the yeast strain around the filter disks containing them causing 
zones of growth inhibition. Those compounds could then be tested against wild-type 
yeast to confirm that they are not simply fungistatic or fungicidal. Such an 
embodiment can also be performed in liquid culture, utilizing standard well known 
methods for measuring cell growth in culture. 

25 6.10.8 Assays for molecules that affect the 

expression of a gene product 
A variety of methods may be employed to influence the expression of a gene 
that is sequenced and characterized according to the methods of this invention. The 
influence of compounds such as peptides and small molecules on gene expression 

30 may be assayed by for example simple Northern analysis, hybridization of cDNA or 
mRNA to oligonucleotide arrays (see e^g., Fair et aL, 1998; and Marton et aL, 1998), 
or global monitoring of gene expression with a reporter gene coupled to different 
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promoters as described by e.g. , Ashby et al. (1996). 

Antisense and ribozyme methods can be effective in influencing the 
expression of one or a limited number of genes. Antisense approaches involve the 
design of oligonucleotides (either DNA or RNA) that are complementary to the gene 
5 mRNA. The antisense oligonucleotides will bind to the complementary gene mRNA 
transcripts and prevent translation. Perfect complementarity, although preferred, is not 
required. 

Oligonucleotides that are complementary to the 5' end of the message, e.g., the 
5 f untranslated sequence up to and including the AUG initiation codon, should work 

10 most efficiently at inhibiting translation. However, sequences complementary to the 3' 
untranslated sequences of mRNAs have been shown to be effective at inhibiting 
translation of mRNAs as well, see generally, Wagner (1994). Thus, oligonucleotides 
complementary to either the 5'- or 3 f - non-translated, non-coding regions of the gene 
could be used in an antisense approach to inhibit translation of the endogenous gene 

15 mRNA. 

Oligonucleotides complementary to the 5' untranslated region of the mRNA 
should include the complement of the AUG start codon. Antisense oligonucleotides 
complementary to mRNA coding regions are less efficient inhibitors of translation but 
could be used in accordance with the invention. Whether designed to hybridize to the 

20 5'*, 3'- regions or coding region of target or pathway gene mRNA, antisense nucleic 
acids should be at least six nucleotides in length, and are preferably oligonucleotides 
ranging from 6 to about 50 nucleotides in length. In specific aspects the 
oligonucleotide is at least 10 nucleotides, at least 17 nucleotides, at least 25 
nucleotides or at least 50 nucleotides. 

25 Regardless of the choice of target sequence, it is preferred that in vitro studies 

are first performed to quantitate the ability of the antisense oligonucleotide to inhibit 
gene expression. It is preferred that these studies utilize controls that distinguish 
between antisense gene inhibition and nonspecific biological effects of 
oligonucleotides. It is also preferred that these studies compare levels of the target 

30 RNA or protein with that of an internal control RNA or protein. Additionally, it is 
envisioned that results obtained using the antisense oligonucleotide are compared with 
those obtained using a control oligonucleotide. It is preferred that the control 
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oligonucleotide is of approximately the same length as the test oligonucleotide and 
that the nucleotide sequence of the oligonucleotide differs from the antisense sequence 
no more than is necessary to prevent specific hybridization to the target sequence. 

The oligonucleotides can be DNA or RNA or chimeric mixtures or derivatives 

5 or modified versions thereof, single-stranded or double-stranded. The oligonucleotide 
can be modified at the base moiety, sugar moiety, or phosphate backbone, for 
example, to improve stability of the molecule, hybridization, etc. The oligonucleotide 
may include other appended groups such as peptides (e.g. , for targeting host cell 
receptors in vivo) , or agents facilitating transport across the cell membrane (see , e.g., 

10 Letsinger et ah, 1989; Lemaitre et ah, 1987; Tullis, 1990) or the blood-brain barrier 
(see , e.g., Pardridge et aL, 1989), hybridization-triggered cleavage agents (see, e^, 
van der Krol et aL, 1988) or intercalating agents (see , e^, Zon, 1988). To this end, the 
oligonucleotide may be conjugated to another molecule, ^g., a peptide, hybridization 
triggered cross-linking agent, transport agent, hybridization-triggered cleavage agent, 

15 etc. 

The antisense oligonucleotide may comprise at least one modified base moiety 
which is selected from the group including but not limited to 5-fluorouracil, 5- 
bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5- 
(carboxyhydroxylmethyl) uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5- 

20 carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, 
N6-isopentenyladenine, 1 -methylguanine, 1-methylinosine, 2,2-dimethyl guanine, 2- 
methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7- 
methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, 
beta-D-mannosylqueosine, 5* -methoxycarboxymethyluracil, 5-methoxyuracil, 2- 

25 methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, 
pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4- 
thiouracil, 5-methyiuracil, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid 
(v), 5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl) uracil, (acp3)w, and 2,6- 
diaminopurine. 

30 The antisense oligonucleotide may also comprise at least one modified sugar 

moiety selected from the group including but not limited to arabinose, 2- 
fluoroarabinose, xylulose, and hexose. 
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In yet another embodiment, the antisense oligonucleotide comprises at least 
one modified phosphate backbone selected from the group consisting of a 
phosphorothioate, a phosphorodithioate, a phosphoramidothioate, a phosphoramidate, 
a phosphordiamidate, a methylphosphonate, an alkyl phosphotri ester, and a formacetal 
5 or analog thereof. 

In yet another embodiment, the antisense oligonucleotide is an alpha-anomeric 
oligonucleotide. An alpha-anomeric oligonucleotide forms specific double-stranded 
hybrids with complementary RNA in which, contrary to the usual beta-units, the 
strands run parallel to each other (Gautier et ah, 1987). The oligonucleotide is a 2*-0- 
10 methylribonucleotide (Inoue et aL, 1987a), or a chimeric RNA-DNA analogue (Inoue 
etaL, 1987b). 

Oligonucleotides of the invention may be synthesized by standard methods 
known in the art, ej*. by use of an automated DNA synthesizer (such as are 
commercially available from Biosearch, Applied Biosystems, etcj. As examples, 

15 phosphorothioate oligonucleotides may be synthesized by the method of Stein et aL 
(1988), methylphosphonate oligonucleotides can be prepared by use of controlled pore 
glass polymer supports (Sarin et aL, 1988), etc. 

The antisense molecules should be delivered to cells which express the gene in 
vivo . A number of methods have been developed for delivering antisense DNA or 

20 RNA to cells; e.g., antisense molecules can be injected directly into the tissue site, or 
modified antisense molecules, designed to target the desired cells (e.g. , antisense 
linked to peptides or antibodies that specifically bind receptors or antigens expressed 
on the target cell surface) can be administered systemically. 

However, it is often difficult to achieve intracellular concentrations of the 

25 antisense sufficient to suppress translation of endogenous mRNAs. Therefore a 
preferred approach utilizes a recombinant DNA construct in which the antisense 
oligonucleotide is placed under the control of a strong promoter. The use of such a 
construct to transfect target cells will result in the transcription of sufficient amounts 
of single stranded RNAs that will form complementary base pairs with the 

30 endogenous gene transcripts and thereby prevent translation of the gene mRNA. For 
example, a vector can be introduced in vivo such that it is taken up by a cell and 
directs the transcription of an antisense RNA. Such a vector can remain episomal or 
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become chromosomally integrated, as long as it can be transcribed to produce the 
desired antisense RNA. Such vectors can be constructed by recombinant DNA 
technology methods standard in the art. Vectors can be plasmid, viral, or others 
known in the art, used for replication and expression in cells. Expression of the 
5 sequence encoding the antisense RNA can be by any promoter known in the art to act 
in the appropriate cells. Such promoters can be inducible or constitutive. Such 
promoters include but are not limited to: the SV40 early promoter region (Bemoist et 
aL, 1981), the promoter contained in the 3' long terminal repeat of Rous sarcoma virus 
(Yamamoto et aL, 1980), the herpes thymidine kinase promoter (Wagner et aL, 1981), 
10 the regulatory sequences of the metallothionein gene (Brinster et aL, 1982), etc. Any 
type of plasmid, cosmid, YAC or viral vector can be used to prepare the recombinant 
DNA construct which can be introduced directly into the tissue site. Alternatively, 
viral vectors can be used which selectively infect the desired cells. 

Ribozymes are enzymatic RNA molecules capable of catalyzing the specific 
15 cleavage of RNA (For a review see, for example Rossi, 1994). The mechanism of 
ribozyme action involves sequence specific hybridization of the ribozyme molecule to 
complementary target RNA, followed by a endonucleolytic cleavage. The composition 
of ribozyme molecules must include one or more sequences complementary to the 
target gene mRNA, and must include the well known catalytic sequence responsible 
20 for mRNA cleavage. For this sequence, see Cech et aL (1992). As such, within the 
scope of the invention are engineered hammerhead motif ribozyme molecules that 
specifically and efficiently catalyze endonucleolytic cleavage of RNA sequences 
encoding target gene proteins. 

Ribozyme molecules designed to catalytically cleave the gene mRNA 
25 transcripts can also be used to prevent translation of the gene mRNA and expression 
of target or pathway gene. (See, e.g. , Cech, et aL 1990; Sarver et aL, 1990). While 
ribozymes that cleave mRNA at site specific recognition sequences can be used to 
destroy the gene mRNAs, the use of hammerhead ribozymes is preferred. 
Hammerhead ribozymes cleave mRNAs at locations dictated by flanking regions that 
30 form complementary base pairs with the target mRNA. The sole requirement is that 
the target mRNA have the following sequence of two bases: 5 , -UG-3 t . The 
construction and production of hammerhead ribozymes is well known in the art and is 




WO 00/24937 ?9 PCT/US99/25037 

described more fully by Haseloff et aL (1988). Preferably the ribozyme is engineered 
so that the cleavage recognition site is located near the 5* end of the gene mRNA; Le., 
to increase efficiency and minimize the intracellular accumulation of non- functional 
mRNA transcripts. 

5 The ribozymes of the present invention also include RNA endoribonucleases 

(hereinafter "Cech-type ribozymes") such as the one which occurs naturally in 
Tetrahvmena thermophila (known as the TVS, or L-19 IVS RNA) and which has been 
extensively described by Cech and collaborators (see, e.g. Zaug et aL, 1984; Zaug et 
aL, 1986a; Zaug et aL, 1986b; Cech et a!, 1991; Cech, 1986). The Cech-type 

10 ribozymes have an eight base pair active site which hybridizes to a target RNA 
sequence whereafter cleavage of the target RNA takes place. The invention 
encompasses those Cech-type ribozymes which target eight base-pair active site 
sequences that are present in the gene. 

As in the antisense approach, the ribozymes can be composed of modified 

15 oligonucleotides (e.g. for improved stability, targeting, etc.) and should be delivered to 
cells which express the gene of interest in vivo . A preferred method of delivery 
involves using a DNA construct "encoding" the ribozyme under the control of a 
strong constitutive promoter, so that transfected cells will produce sufficient quantities 
of the ribozyme to destroy endogenous gene messages and inhibit translation. Because 

20 ribozymes unlike antisense molecules, are catalytic, a lower intracellular concentration 
is required for efficiency. 

In instances wherein the antisense, ribozyme, and/or triple helix molecules 
described herein are utilized to inhibit mutant gene expression, it is possible that the 
technique can also efficiently reduce or inhibit the transcription (triple helix) and/or 

25 translation (antisense, ribozyme) of mRNA produced by normal target gene alleles 
that the possibility can arise wherein the concentration of normal target gene product 
present can be lower than is necessary for a normal phenotype. In such cases, to 
ensure that substantially normal levels of target gene activity are maintained, 
therefore, nucleic acid molecules that encode and express target gene polypeptides 

30 exhibiting normal target gene activity can be introduced into cells via gene therapy 
methods that do not contain sequences susceptible to whatever antisense, ribozyme, or 
triple helix treatments are being utilized. Alternatively, in instances whereby the target 



WO 00/24937 



80 



PCTAJS99/25037 



gene encodes an extracellular protein, it can be preferable to co-administer normal 
target gene protein in order to maintain the requisite level of target gene activity. 

Anti-sense RNA and DNA, ribozyme, and triple helix molecules of the 
invention can be prepared by any method known in the art for the synthesis of DNA 
and RNA molecules. These include techniques for chemically synthesizing 
oligodeoxyribonucleotides and oligoribonucleotides well known in the art such as for 
example solid phase phosphoramidite chemical synthesis. Alternatively, RNA 
molecules can be generated by in vitro and in vivo transcription of DNA sequences 
encoding the antisense RNA molecule. Such DNA sequences can be incorporated into 
a wide variety of vectors which incorporate suitable RNA polymerase promoters such 
as the T7 or SP6 polymerase promoters. Alternatively, antisense cDNA constructs that 
synthesize antisense RNA constitutively or inducibly, depending on the promoter 
used, can be introduced stably into cell lines. 

Various well-known modifications to the DNA molecules can be introduced as 
a means of increasing intracellular stability and half-life. Possible modifications 
include, but are not limited to, the addition of flanking sequences of ribo- or deoxy- 
nucleotides to the 5 1 and/or 3' ends of the molecule or the use of phosphorothioate or 
2' O-methyl rather than phosphodiesterase linkages within the 
oligodeoxyribonucleotide backbone. 

Endogenous gene expression can also be reduced by specifically inactivating 
or "knocking out" the target and/or pathway gene or its promoter using targeted 
homologous recombination, (e.g., see Smithies et aL, 1985; Thomas et aL, 1987; 
Thompson et aL, 1989). For example, a mutant, non- functional gene (or a completely 
unrelated DNA sequence) flanked by DNA homologous to the endogenous gene 
(either the coding regions or regulatory regions of the gene) can be used, with or 
without a selectable marker and/or a negative selectable marker, to transfect cells that 
express the gene in vivo . Insertion of the DNA construct, via targeted homologous 
recombination, results in inactivation of the gene. Such approaches are particularly 
suited in the agricultural field where modifications to ES (embryonic stem) cells can 
be used to generate animal offspring with an inactive gene (e.g. . see Thomas et aL, 
1987 and Thompson et aL, 1989). Such techniques can also be utilized to generate 
immune disorder animal models. It should be noted that this approach can be adapted 
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for use in humans provided the recombinant DNA constructs are directly administered 
or targeted to the required site in vivo using appropriate viral vectors, ej>., herpes 
virus vectors. Targeted homologous recombination also is useful to introduce point 
mutations into a gene or other small modifications that may alter the activity of a gene 

5 product. Other methods of targeting specific changes to a gene make use of, for 
example small RNA/DNA hybrids (see Cole-Strauss et al, 1996; Ye et aL, 1998). 

Alternatively, endogenous gene expression can be reduced by targeting 
deoxyribonucleotide sequences complementary to the regulatory region of the gene 
(i.e. , the gene promoter and/or enhancers) to form triple helical structures that prevent 

10 transcription of the gene in target cells in the body. (See generally, Helene, C. 1991; 
Helene etaL, 1992; andMaher, 1992). 

6.10.9 Assays for the biological activity of 
polypeptides 

The methods described above to assay a compound for interactions with a gene 
15 product or effects on the biological function and/or expression of a gene product can 
equally be used to assay polypeptides, polypeptide fragments (and analogs) encoded in 
polynucleotides and homologs identified and characterized according to the parallel 
methods of the invention. See, Hider et aL (1991), Taylor et aL (1994), Goodman et 
aL (1995), Osslund (1996). In addition, the polypeptides and analogs can be assayed 
20 for biological (or pharmacological) activity in tissue culture or in an organism. See for 
example Weissmann (1985), Jones et aL (1987), Lin (1987), Souza (1989, 1992), 
Pierce et aL (1998), Stern, M.E. (1999), Samal (1999), Bachmaier et aL (1999), and 
Tartaglia(1999). 

6.10.10 In vitro evolution 

25 Another embodiment of this invention involves mutagenizing a sequenced 

polynucleotide and assaying the encoded polypeptide for altered activity. A 
polynucleotide that encodes a gene product isolated from a natural source can serve as 
a template for subsequent modification and "improvement" of the gene product for 
specific uses. Site-directed mutagenesis is well known in the art and has long been 

30 used to modify the activity of a gene product (see for example, Pictet, 1991; Kunkel, 
1989; Chappel et aL, 1993; Chaleff, 1994; Powers et aL 1998; Gehrke et aL, 1994; 
Yamashita et aL, 1994; Harper et aL, 1990; Zukowski et aL, 1990). Random 
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mutagenesis followed by selection or screening protocols provide powerful methods 
to alter the activity of a gene product (see for example Davis et aL, 1980; Miller, 
1972; Rose et aL, 1990). More recently, techniques have been developed that couple 
random mutagenesis and in vitro evolution to sample a greater variety of potentially 
useful mutations than can reasonably be assayed by the more traditional techniques 
mentioned above (see Stemmer, 1997; Buchholz et aL, 1998; and Zhao et aL, 1998). 
6.10.1 1 Pharmaceutical preparations 

and methods of administration 

The nucleic acid sequences, polypeptides and other compounds described 
above may have therapeutic value and may be administered to a patient at 
therapeutically effective doses to treat or ameliorate disease. A therapeutically 
effective dose refers to that amount of a compound sufficient to result in amelioration 
of the disease symptoms, or alternatively, to that amount of a nucleic acid sequence 
sufficient to modulate the expression of a gene product which results in the 
amelioration of the disease symptoms. 

6.10.11.1 Effective dose 

Toxicity and therapeutic efficacy of compounds can be determined by standard 
pharmaceutical procedures in cell cultures or experimental animals, e.g. , for 
determining the LD 50 (the dose lethal to 50% of the population) and the ED 50 (the 
dose therapeutically effective in 50% of the population). The dose ratio between toxic 
and therapeutic effects is the therapeutic index and it can be expressed as the ratio 
LD50/ED50. Compounds which exhibit large therapeutic indices are preferred. While 
compounds that exhibit toxic side effects can be used, care should be taken to design a 
delivery system that targets such compounds to the site of affected tissue in order to 
minimize potential damage to uninfected cells and, thereby, reduce side effects. 

The data obtained from the cell culture assays and animal studies can be used 
in formulating a range of dosage for use in humans. The dosage of such compounds 
lies preferably within a range of circulating concentrations that include the ED50 with 
little or no toxicity. The dosage can vary within this range depending upon the dosage 
form employed and the route of administration utilized. For any compound used in the 
method of the invention, the therapeutically effective dose can be estimated initially 
from cell culture assays. A dose can be formulated in animal models to achieve a 
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circulating plasma concentration range that includes the IC50 (i.e.. the concentration of 
the test compound which achieves a half-maximal inhibition of symptoms) as 
determined in cell culture. Such information can be used to more accurately determine 
useful doses in humans. Levels in plasma can be measured, for example, by high 
performance liquid chromatography. 

6.10.11.2 Formulations and use 
Pharmaceutical compositions for use in accordance with the present invention 
can be formulated in conventional manner using one or more physiologically 
acceptable carriers or excipients. 

Thus, the compounds and their physiologically acceptable salts and solvents 
can be formulated for administration by inhalation or insufflation (either through the 
mouth or the nose) or oral, buccal, parenteral or rectal administration. 

For oral administration, the pharmaceutical compositions can take the form of, 
for example, tablets or capsules prepared by conventional means with 
pharmaceutically acceptable excipients such as binding agents (e.g., pre-gelatinized 
maize starch, polyvinylpyrrolidone or hydroxypropyl methylcellulose); fillers (e.g.. 
lactose, microcrystalline cellulose or calcium hydrogen phosphate); lubricants (e.g.. 
magnesium stearate, talc or silica); disintegrants ( e.g.. potato starch or sodium starch 
glycolate); or wetting agents (e.g. , sodium lauryl sulphate). The tablets can be coated 
by methods well known in the art. Liquid preparations for oral administration can take 
the form of, for example, solutions, syrups or suspensions, or they can be presented as 
a dry product for constitution with water or other suitable vehicle before use. Such 
liquid preparations can be prepared by conventional means with pharmaceutically 
acceptable additives such as suspending agents (e.g. . sorbitol syrup, cellulose 
derivatives or hydrogenated edible fats); emulsifying agents (e.g. . lecithin or acacia); 
non-aqueous vehicles (e.g. . almond oil, oily esters, ethyl alcohol or fractionated 
vegetable oils); and preservatives (e.g. . methyl or propyl-p-hydroxybenzoates or 
sorbic acid). The preparations can also contain buffer salts, flavoring, coloring and 
sweetening agents as appropriate. 

Preparations for oral administration can be suitably formulated to give 
controlled release of the active compound. For buccal administration the compositions 
can take the form of tablets or lozenges formulated in conventional manner. 
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For administration by inhalation, the compounds for use according to the 
present invention are conveniently delivered in the form of an aerosol spray 
presentation from pressurized packs or a nebulizer, with the use of a suitable 



5 dichlorotetrafluoroethane, carbon dioxide or other suitable gas. In the case of a 
pressurized aerosol the dosage unit can be determined by providing a valve to deliver 
a metered amount. Capsules and cartridges of ej^ gelatin for use in an inhaler or 
insufflator can be formulated containing a powder mix of the compound and a suitable 
powder base such as lactose or starch. 

10 The compounds can be formulated for parenteral administration (i.e. , 

intravenous or intramuscular) by injection, via, for example, bolus injection or 
continuous infusion. Formulations for injection can be presented in unit dosage form, 
e.g., in ampoules or in multi-dose containers, with an added preservative. The 
compositions can take such forms as suspensions, solutions or emulsions in oily or 

15 aqueous vehicles, and can contain formulatory agents such as suspending, stabilizing 
and/or dispersing agents. Alternatively, the active ingredient can be in powder form 
for constitution with a suitable vehicle, e.g., sterile pyrogen-free water, before use. 

The compounds also can be formulated in rectal compositions such as 
suppositories or retention enemas, e.g. , containing conventional suppository bases 

20 such as cocoa butter or other glycerides. 

In addition to the formulations described previously, the compounds also can 
be formulated as depot preparations. Such long acting formulations can be 
administered by implantation (for example subcutaneously or intramuscularly) or by 
intramuscular injection. Thus, for example, the compounds can be formulated with 

25 suitable polymeric or hydrophobic materials (for example as an emulsion in an 
acceptable oil) or ion exchange resins, or as sparingly soluble derivatives, for 
example, as a sparingly soluble salt. 

The compositions can, if desired, be presented in a pack or dispenser device 
which can contain one or more unit dosage forms containing the active ingredient. 

30 The pack can for example comprise metal or plastic foil, such as a blister pack. The 
pack or dispenser device can be accompanied by instructions for administration. 
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6.10.12 Assays for polymorphisms 

Polymorphisms represent differences in DNA sequence between members of 
the same species. Polymorphisms include for example, single nucleotide 
polymorphisms (SNPs), variations in Short Tandem Repeats (STRs), Restriction 

5 Fragment Length Polymorphisms (RFLPs), insertions, deletions and rearrangements. 

Well developed methods exist in the art for assaying polymorphic and 
phenotypic differences between individuals by genetic mapping to characterize the 
genetic changes that give rise to phenotypic variation. These methods can be used, for 
example, to discover mutations responsible for genetic disease, to manipulate and 

10 breed useful traits in plants and animals, to discover elements in genetic pathways, to 
diagnose propensity towards disease, to determine and diagnose drug response, etc. 
(see for example, Stone et aL, 1999; Lebo et aL, 1998; Giordano et aL, 1998; 
Rothschild et aL, 1996; Blumenfeld et aL, 1995; Meyer et aL, 1997; Kamb, 1997; 
Skolnick et aL, 1997). Many phenotypic traits are multifactorial and methods are well 

15 known for using polymorphisms to discover Quantitative Trait Loci (see for example 
Webb et aL, 1999; Helentjaris et aL, 1995; Dupuis et aL, 1999; Umari et ah, 1996; 
Lander et aL, 1986 & 1989). STRs are highly polymorphic genetic elements and their 
use in genetic mapping is well known in the art, see for example Caskey et aL (1994) 
and Polymeropoulos (1995). The utility of SNPs for genetic mapping has recently 

20 progressed considerably due to improvements in technology, in particular the ability 
to assay many different SNPs simultaneously by using for example oligonucleotide 
arrays. For representative examples see Nikiforov et aL (1999); Shuber, A.P. (1996); 
Jakubowski et aL (1999); Cho et aL (1999); Brookes (1999); Kruglyak (1999); 
Sapolsky et aL (1999); Xiong et aL (1999); Wang etaL (1998). 

25 Polymorphisms are easily discovered by sequencing DNA from one or more 

individuals according to the methods described above and comparing the sequences to 
discover differences in homologous regions. Indeed, the sequencing invention can be 
used both as a means to discover new polymorphisms and as a method to assay 
polymorphisms for genetic mapping studies. Clearly any type of polymorphism can be 

30 quickly assayed by sequencing the DNA (see Santamaria et aL, 1 997). 

To minimize the number of sequences needed to assay an individual, DNA 
may be enriched for polymorphisms prior to the sequencing reaction. For example, 
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Ostrander et al (1992) describe a method to enrich for STR sequences in a genomic 
library. For other examples see Kandpal et al (1994); Karagyozov et ah (1993) and 
Paetkau (1999). 

The physical mapping methods described above provide another means to 
5 discover and assay polymorphic differences between individuals in a population. In 
most cases, differences in the landmarks between individuals represent differences in 
the nucleotide sequence. There are exceptions, for example differences in methylation 
patterns between individuals can be assayed by employing bisulfite-induced 
modifications prior to cloning the DNA whereby cytosine is converted to uracil, but 5- 
10 methylcytosine remains unchanged (see e^, Frommer et aL, 1 992). 

A preferred landmark for assaying polymorphisms is the restriction site. For 
example, assuming 0.1% of nucleotides are polymorphic between two homologous 
chromosomes, then for any 6-base restriction site about 1 in 167 sites will be 
polymorphic (le^ one site will be cut by the restriction enzyme and one site cannot be 
15 cut). If we assume a genome size of 3xl0 9 base pairs, then we can expect about 
(3xl0 9 /4 6 )/167 = -4400 polymorphic restriction sites per assayed 6-base restriction 
enzyme. Of course, once polymorphisms are discovered, they can be assayed in a 
population by other methods such as those mentioned above. 

6.10.13 Assays for genomic alterations 

20 within an individual 

Changes can occur in the genome of an individual during the course of 
development or during the progression of disease. The result is variation between 
different populations of cells within the individual. This variation can be assayed 
using the parallel methods of this invention. 
25 Any change at the nucleotide level can be determined simply by sequencing 

the genomic DNA from different populations of cells using the parallel methods 
described above. For example, the sequence of DNA from cancerous tissue can be 
compared to the sequence of DNA from nearby normal tissue. Changes in the genome 
are known to occur during disease progression, and analysis at the sequence level can 
30 help to pinpoint those changes that contribute to the diseased state. 

Genomic rearrangements can readily be assayed at a lower resolution by 
observing differences in the landmarks. In particular, changes in restriction site 
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patterns will occur not only as a result of single base changes, but also due to 
rearrangements at both a fine and gross level. The physical mapping methods 
described above typically yield information from a larger contiguous stretch of DNA 
than the sequencing methods. Thus fewer clones need to be analyzed to quickly 
"survey" the genomic DNA from, for example, a diseased tissue. Rearrangements 
discovered in this manner may represent changes that contribute to the diseased state. 
6.10.14 Transgenic organisms 

Polynucleotides characterized according to this invention can be expressed in 
transgenic multicellular organisms. Animal of any species, including, but not limited 
to, mice, rats, rabbits, guinea pigs, pigs, micro-pigs, goats, and non-human primates, 
e.g. , baboons, monkeys, and chimpanzees may be used to generate transgenic animals. 
Other animal species may be used to create transgenic animals such as Drosophila, C. 
elegans. Xenopus, zebra fish, etc. Polynucleotides can also be inserted into the 
genomes of a variety of plants and microorganisms to create transgenic organisms. 

Any technique known in the art may be used to introduce a polynucleotide or 
its associated gene into organisms to produce the founder lines of transgenic 
organisms. Such techniques include, but are not limited to pronuclear microinjection 
(Wagner et ah, 1989); retrovirus mediated gene transfer into germ lines (van der 
Putten et aL, 1985); gene targeting in embryonic stem cells (Thompson et aL, 1989); 
electroporation of embryos (Lo, 1983); sperm-mediated gene transfer (Lavitrano et aL, 
1989; Perry et aL, 1999); Agrobacterium tumefaciens mediated transformation (An et 
aL, 1988; Chee et aL, 1992; Moloney et aL, 1993), etc. For a review of animal 
techniques, see Gordon, (1989). Other examples include Lundquist et aL (1996), 
YoderetaL (1993), and Krzyzek et aL (1995). 

The present invention provides for transgenic organisms that carry the 
transgenes in all their cells, as well as organisms which carry the transgene in some, 
but not all their cells, Le^ mosaics. The transgene may be integrated as a single 
transgene or in concatamers, ej>., head-to-head tandems or head-to-tail tandems. The 
transgene also may be selectively introduced into and activated in a particular cell type 
by following, for example, the teaching of Lasko et aL (1992). The regulatory 
sequences required for such a cell-type specific activation will depend upon the 
particular cell type of interest, and will be apparent to those of skill in the art. When it 
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is desired that the transgene be integrated into the chromosomal site of the 
endogenous gene, gene targeting is preferred. Briefly, when such a technique is to be 
utilized, vectors containing some nucleotide sequences homologous to the endogenous 
gene are designed for the purpose of integrating, via homologous recombination with 
5 chromosomal sequences, into and disrupting or modifying the function of the 
nucleotide sequence of the endogenous gene. The transgene also may be selectively 
introduced into a particular cell type, thus inactivating the endogenous gene in only 
that cell type, by following, for example, the teaching of Gu et aL (1994). The 
regulatory sequences required for such a cell-type specific inactivation will depend 
10 upon the particular cell type of interest, and will be apparent to those of skill in the art. 

Methods for the production of single-copy transgenic organisms with chosen 
sites of integration are also well known to those of skill in the art. See, for example, 
Bronson et ah (1996) and Bradley etaL (1997). 

Once transgenic organisms have been generated, the expression of the 
15 recombinant gene may be assayed utilizing standard techniques. Initial screening may 
be accomplished by Southern blot analysis or PCR techniques to analyze animal 
tissues to assay whether integration of the transgene has taken place. The level of 
mRNA expression of the transgene in the tissues of the transgenic animals also may 
be assessed using techniques which include but are not limited to Northern blot 
20 analysis of tissue samples obtained from the animal, in shu hybridization analysis, and 
RT-PCR. Samples of the gene-expressing tissue may also be evaluated 
immunocytochemically using antibodies specific for the transgene product. 

The methods described above for generating cells with insertion elements at 
known locations are well suited to the generation of transgenic organisms with 
25 insertion elements in their genomes. For example, the methods may be practiced on 
mouse embryonic stem cells from which an adult animal can be cloned. Other 
animals have been cloned from cell lines derived from embryos, see for example 
Campbell et aL (1996), Chen et aL (1999), Hong et aL (1998), Baguisi et aL (1999), 
and Cibelli etaL (1998). 
30 Animals and cell lines with mapped insertion elements and transgenic animals 

and cell lines made by other methods such as those described above have the potential 
to model various human diseases (e.g. , Robinson et aL, 1996). In this context, the 
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animals or cell lines can serve as tools to test pharmaceuticals for efficacy in treating 
the disease (see for example Cordell, 1995; Weinshilboum et aL, 1995; Leder et aL, 
1992; Hammer, 1996; Groffen et aL, 1996; Terhorst et aL, 1996; Donehower et aL, 
1996; Lazzarini, 1997). The transgenic animal model systems may be used as a test 
substrate to identify drugs, pharmaceuticals, therapies and interventions which may be 
effective in treating the disease or disorder of interest. Therapeutic agents may be 
administered systemically or locally. Suitable routes may include oral, rectal, or 
intestinal administration; parenteral delivery, including intramuscular, subcutaneous, 
intramedullary injections, as well as intrathecal, direct intraventricular, intravenous, 
intraperitoneal, intranasal, or intraocular injections, to name just a few. The response 
of the animals to the treatment may be monitored by assessing the reversal of the 
disease. With regard to intervention, any treatments which reverse any aspect of the 
disease should be considered as candidates for therapeutic intervention. Dosages of 
test agents may be determined by deriving dose-response curves. 

The transgenic animal model systems for a disease also may be used as test 
substrates to identify environmental factors, drugs, pharmaceuticals, and chemicals 
which may exacerbate the progression of the disease that the transgenic animals 
exhibit. 

In an alternate embodiment, the transgenic animal models for disease may be 
used to derive a cell line which may be used as a test substrate in culture, to identify 
both agents that reduce and agents that enhance the disease. While primary cultures 
derived from the transgenic animals of the invention may be utilized, the generation of 
continuous cell lines is preferred. For examples of techniques which may be used to 
derive a continuous cell line from the transgenic animals, see Small et aL, 1985. 

Insertion elements at known locations can serve as a starting point for 
subsequent targeted modifications to the genome. For example, the insertion element 
may carry a marker such as HSV-TK for which a negative selection exists (Capecchi 
et aL, 1996). Targeted modifications to the DNA surrounding the insertion element 
can be generated in the cell by first modifying in vitro a subclone of the surrounding 
DNA (absent the insertion element) using traditional recombinant methods, 
transfecting the modified subclone into for example a cell line that carries the 
insertion element, and selecting for loss of the HSV-TK marker. The end result is the 
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loss of the insertion element and the introduction of the targeted modification. The 
insertion element may also carry a cleavage site for a rare-cutting enzyme such as 
See I, in which case cotransfection with a plasmid encoding See I endonuclease may 
lead to double-strand breaks and improved rates of targeted homologous 

5 recombination (see e^g, Dujon et aL, 1 999; and Smih et ah, 1 995). 

6.10.15 Databases 
The sequences of polynucleotides determined by the methods described above 
can be stored in a database to facilitate analysis of the information. Methods for 
preparing a database of sequence information are well known in the art, see for 

10 example Bilofsky et ah (1986), Benson et ah (1994), Doolittle (1990), and Sabatini et 
aK (1999). Other databases can be created from the sequence information such as for 
example a database of polymorphisms and a polypeptide database comprising 
theoretical translations of polynucleotide sequences, see e^ Claverie et ah (1985) and 
StulichetaL(1989). 

15 6.1 1 Kits for Implementing the Method of the 

Invention 

The invention includes kits for carrying out the various embodiments of the 
invention. Preferably, kits of the invention include a set of primers and/or adapters for 
carrying out the reactions and amplifications in accordance with the invention. Kits 

20 also may include an array of tag complements attached to a solid phase support. 
Additionally, kits of the invention may include sample tags or sample-tagged vectors. 
Kits also may contain appropriate buffers for enzymatic processing, detection 
chemistries, ej*. fluorescent components for labeling amplicons, instructions for use, 
processing enzymes, such as ligases, polymerases, and so on. These and other aspects 

25 of the invention are illustrated by the following non-limiting examples. 
7. EXAMPLES 

7.1 Example 1 

In this example, sequence information was obtained for a subset of cloned 
inserts from a pool of about 1 10 different cloned inserts. Sample-tagged vectors with 
30 inserts were constructed in E. coli using standard techniques. Sample tags were 
created by ligating complementary pairs of oligonucleotides into the unique Pvu II site 
of the commercial vector pSP72 (Promega). Eleven different tags are shown below: 
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Tagl CAGCACCAGGAAGGTGGCCAGGTTGGCAGTGTA (SEQ ID NO:l) 
Tag2 CCTAGCTCTCTTGAAGTCATCGGCCAGGGTGGA (SEQ ID NO : 2 ) 
Tag3 ATCAAGCTTATGGATCCCGTCGACCT (SEQ ID NO: 3) 

Tag 4 GGTGCTCGTGTCTTTATCGTCCCTACGTCTCTT (SEQ ID NO: 4) 

5 Tag5 AATTTTGAAGTTAGCTTTGATTCCATTC (SEQ ID NO: 5) 

Tag6 GGCGTCCTGCTGCAGTCTGGCATTGGGGAA (SEQ ID NO: 6) 

Tag7 ATTGAAGATGGAGGCGTTCAACTAGCA (SEQ ID NO: 7) 

Tag8 GATGAACTATACAAGCTTATGTCCAGACTTCCA (SEQ ID NO: 8) 
Tag9 AAGGGCAGATTGGTAGGACAGGTAATG (SEQ ID NO: 9) 

10 Tag 10 CCGTCGGGCATCCGCGCCTTGAG (SEQ ID NO: 10) 

Tagil TACATTGTGTGAGTTGAAGTTGTATTCCAATTT (SEQ ID NO: 11) 

Inserts were cloned between the Bgl II and Xba I sites of pSP72. These inserts 
were derived from a complete restriction digest of rat genomic DNA with BamH I, 
Bgl II and Xba I. The relevant sequence of a Tagl 1 construct is shown below: 

15 

1 10 20 30 

. . . ATTT AGGTGACACTATAGAACTCGACCAG TACATTGTGTGAGT 
SP72for>> 

40 50 60 70 80 

20 TGAAGTTGTATTCCAATTT CTGAAGCTTGCATGCCTG CAGGTCGACT 

<<SP72rev 

90 

CTAGA(SEQ ID NO: 12) . .INSERT. . GATCTGCCGGTCT(SEQ ID NO: 13). . . 

The tag is shown in bold lettering. Underlined sequences represent 
25 oligonucleotides (SP72for and SP72 rev) described below. 

For constructs containing Tagl through TaglO, a single random insert was 
cloned and grown to saturation in liquid media. For constructs containing Tagil, 
about 100 random inserts were cloned, pooled and grown to saturation in liquid 
media. A pool of about 110 random inserts was made by diluting each single isolate 
30 (i.e., constructs containing Tagl through TaglO) into the pool of Tagl 1 constructs at a 
ratio of 1 : 1 00. In this pool, with the exception of Tagl 1 , each tag is associated with a 
single, unique insert. This pool of about 110 constructs was grown further, and 
plasmid DNA was prepared using a Qiagen midiprep kit. 
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The plasmid DNA (3 \ig) was sequenced using a Sequenase kit (Amersham 
Pharmacia Biotech) and primer SP72for (GGTGACACTATAGAACTCGAGCAG, 
SEQ ID NO: 14). Note this primer sequences through the tag and into the insert. 35 S- 
dATP was incorporated during the sequencing reaction. The labeled products were 
5 separated in four lanes of a standard 6% polyacrylamide urea sequencing gel in IX 
TBE (89 mM Tris borate, pH 8.3/ 2 mM EDTA). The gel was dried onto Whatman 
3MM paper. 

The sequencing ladder was visualized by exposing the gel to film. The 
sequence of base 29 through base 90 was clearly visible (see FIG. 7a). This result is 

10 expected since constructs containing Tagl 1 made up over 90% of the pool. After base 
90, a uniform evenly-spaced ladder of over 100 bands was evident in all four lanes. 
This multiplex ladder represents the superposition of the sequencing ladders from all 
the clones in the pool. 

The film was aligned with the dried gel. Using the multiplex ladder as a 

15 marker, 10 adjacent sections were excised from each lane with a razor blade so that 
adjacent edges were touching. Each section contained only one marker band, which 
was situated in the middle of the section (see FIG. 7b). The first four sections (one 
from each lane, taken from the bottom of the sectioned region of the gel) contained 
bands at the eleventh position of the multiplex ladder, which corresponds to "base" 

20 101 in the Tagil construct shown above. The 40 sections (or fractions) were 
separately placed into 100 ul H 2 0 and heated to 70°C for 20 minutes. One microliter 
of the eluted DNA was amplified in a 20 ul polymerase chain reaction with Taq 
polymerase and PCR buffer according to the manufacturer's instructions (Promega). 
Briefly, the primers SP72for (SEQ ID NO: 14) and SP72rev, 

25 CAGGCATGCAAGCTTCAG (SEQ ID NO: 15) were used at 0.8uM with 0.2mM 
dNTPs, 1.5mM MgC12, PCR buffer and polymerase. The PCR mixture was subjected 
to the following cycle parameters: 94°C, 30s; 55°C, 30s; 72°C, 30s; 35 cycles. 

The PCR mixtures were treated with phosphatase to remove residual 
unincorporated nucleotides. In a lOul reaction the following were combined: 3ul PCR 

30 mixture, 7 pi Shrimp Alkaline Phosphatase (United States Biochemical, diluted to 
0.14 units/|il). The reactions were incubated at 37°C for 15 minutes and terminated at 
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80°C for 15 minutes. 2 |il solution of fresh primers (SP72for and SP72rev each at 
2.4jiM) was added to each reaction. The resulting mixtures were heated to 100°C for 2 
minutes and immediately placed on ice in preparation for labeling with 32 P-dATP. 

The labeling step was accomplished with Sequenase, Reaction Buffer and 

5 Labeling Mix supplied by the manufacturer (Amersham Pharmacia Biotech). Briefly, 
3jil phosphatase-treated PCR mix was combined with 0.60^1 Reaction Buffer, 0.30^1 
0.1M dithiothreitol, 0.12 ^1 Labeling Mix, 0.15 |il 32 P-dATP (3000 Ci/mmol) and 1.1 
^1 Sequenase (diluted to 0.85 units/^1). Reactions were incubated for 10 minutes at 
room temperature. 4jil 0.2mM dNTPs (in lOmM Tris-HCl, lOmM MgC12, 50mM 

10 NaCl pH 7.9) was added to each reaction followed by another 10 minute incubation at 
room temperature. Reactions were terminated by the addition of 2jal lOOmM EDTA. 

Identification of the specific tags in each labeled PCR product was achieved by 
dot-blot hybridization. The oligonucleotides originally used to create the tagged 
constructs were employed again to make the dot-blots. However, small 

15 oligonucleotides will not hybridize well once bound to nylon membranes. It was 
necessary to "lengthen" the oligonucleotides before application to the membrane. 
Using standard techniques, each pair of complementary oligonucleotides described 
above was ligated to Hinc II digested pBR322. The resulting ligation mixture was 
PCR amplified with two primers: one oligonucleotide from the complementary pair 

20 and a second common oligonucleotide, CACTATCGACTACGCGATCA (SEQ ID 
NO: 16). The sequence of the common oligonucleotide begins 320 bases upstream of 
the Hinc II site at position 653 in pBR322. The resulting PCR product is simply a 
fusion of the oligonucleotide tag to a 320 base fragment derived from pBR322. The 
dot-blots were made with a 96-well Blotting Apparatus and Zeta-Probe membrane 

25 according to the manufacturer's instructions (BioRad). About 5 to lOng of a 
"lengthened" oligonucleotide tag was applied per spot. 

Each labeled PCR product from above was hybridized to a membrane with 10 
different spots. Each spot hybridizes to a different tag (Tagl through TaglO). The 
labeled PCR products were used directly without further purification. Hybridizations 

30 were performed in 2 ml hybridization solution (0.5 M Na2HP04 pH 7.2, 7% SDS 
according to Zeta-Probe instructions) at 55° C for 20 hours in plastic bags. Four 30- 
minute washes were performed at 40° C. 



WO 00/24937 PCTAJS99/25037 

94 

Autoradiography was performed on the hybridized dot-blots. The results are 
shown in FIG. 8. Each construct containing Tagl through Tag9 was sequenced 
separately by standard means. The expected sequence for constructs containing Tags 
1 through 9 is shown adjacent to the autoradiograms. The hybridization signal 
5 strength depends on the tag sequence. This variability can be minimized by 
optimization of the tag sequence and hybridization conditions. Clearly, when the 
signal strength is high, the hybridization pattern corresponds faithfully to the expected 
sequence. As the hybridization signal approaches background levels, some bases 
become ambiguous. TaglO failed to produce a hybridization signal. The absence of 
10 signal was likely due to differences between the actual Tm of this tag as compared to 
Tagl through Tag9. Note the readout from the array of 10 tag complements has been 
rearranged in FIG. 8 to more clearly show the sequence of the inserts. The actual 
readout was strips of 10 spots corresponding to the ten different tags. 
6.2 Example 2 

15 This example describes a strategy for simultaneously sequencing about 37,000 

different templates. A collection of about 100,000 sequence-tagged vectors is 
constructed from the commercially available bacteriophage vector M13mpl8. Using 
standard methods, the vector M13PL1 is constructed by modifying M13mpl8 
between the EcoR I and Hind HI sites as shown: 

20 BstXI BstXI BamHI 

GAATTCCATGTTGTTGGGGCGCGCCTCCATCAACGTGG ATCCATCGAGACGGTCCA 

TagL>>>> 

EcoICRl PstI Hindi I I 

25 GAGCTCAGTGGCGCATGCAATGCTCCAACTGC AGGTTAGCCATGGTTGCCCA AGCTT 

<<<TagR 

(SEQ ID NO: 17) 

A pool of 100,000 different oligonucleotides is synthesized 3 ->5' on an ABI 
model 394 DNA synthesizer by the "split and pool" approach described by Brenner 
30 (1997b). The sequence "TGCA" is synthesized on 10 columns. A different 5 base 
sequence is added to each of the 10 columns. The column packing material is removed 
from each column, mixed together and repacked into the 10 columns. A different 5 
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base sequence is synthesized on each column. The split, synthesize and pool process is 
repeated three more times. The different sequences synthesized at each step are shown 
in Table 1 (sequences are shown 5'->3'). 

TABLE 1 



column 


step 5 


step 4 


step 3 


step 2 


step 1 


1 


CTACT 


CAGTC 


TGTAG 


TGACA 


GAGCA 


2 


GAACT 


GTGTC 


ACTAG 


AGACT 


CTGCA 


3 


GTTCT 


GACTC 


AGAAG 


TCACT 


CACCA 


4 


GTAGT 


GAGAC 


AGTTG 


TGTCT 


CAGGA 


5 


GTACA 


GAGTG 


AGTAC 


TGAGT 


CAGCT 


6 


GATGA 


GTCAG 


ACATC 


TCTGA 


CTCGT 


7 


CTTGA 


CACAG 


TGATC 


AGTGA 


GACGT 


8 


CAAGA 


CTGAG 


TCTTC 


ACAGA 


GTGGT 


9 


CATCA 


CTCTG 


TCAAC 


ACTGT 


GTCCT 


10 


CATGT 


CTCAC 


TCATG 


ACTCA 


GTCGA 



The oligonucleotides are removed from the column, deprotected and 
concentrated. M13PL1 is cut with Pst I and EcoICR I. 100 fold molar excess of the 
oligonucleotides is ligated to the cut vector. Excess oligonucleotides are removed 
using a Qiaquick kit (Qiagen). The vector/oligonucleotide is "filled in" with Klenow 
fragment (3 , ->5' exo-, New England Biolabs), the reaction products are circularized 
with ligase, and transformed into highly competent XLl-Blue (Stratagene). About 10 
million transfectants are combined to make the sample-tagged vector pool. Double- 
stranded (RF1) DNA is prepared from the pool with the Qiagen Plasmid Purification 
System (Qiagen). 

A mouse genomic library is prepared in the pool of sample-tagged phage 
vectors. Mouse DNA from strain 129/Sv is sheared to a fragment size of 3-6 kb using 
the Hydoshear (Genomic Instrumentation Services; San Carlos, CA; see Oefher et ah, 
1998), according to the manufacturer's instructions. The sheared DNA is ligated to an 
adapter made by annealing the following two phosphorylated oligonucleotides: 
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TGAGTCACCAAC SEQ ED NO: 1 8 
GTGACTCA 

The ligation products are separated on a 1% agarose gel in IX TAE and 2-3 kb 
fragments are cut from the gel. The fragments are purified with a Qiaquick Column 

5 (Qiagen) according to the manufacturers instructions. 

The fragments are ligated into the pool of sample-tagged vectors prepared 
above. The resulting library is electroporated into the strain XLl-Blue (Stratagene) 
and spread onto LB agar plates. About 100,000 transfectants are pooled by eluting the 
phage from the agar plates into a solution of LB. The phage titer is increased by 

10 subsequent growth in liquid LB. 0.1 ml overnight culture of XLl-Blue is combined 
with phage at a multiplicity of infection around 10, diluted into 10 ml LB and grown 
at 37°C to saturation. Phage are separated from the cells by centrifugation and the 
single-stranded phage DNA is purified with the Qiaprep Ml 3 System (Qiagen) 
according to the manufacturer's instructions. 

15 Ten sequencing standard templates are prepared by cloning a random fragment 

of the mouse DNA into each of 10 vectors. Each vector is identical to the sample- 
tagged vectors described above except the 25 base distinct region is replaced with the 
following sequences: 

TCAATCGACTACACTCGTAACAAGA SEQ ID NO: 19 
20 GATCAATTCGCTAATCGATCGTATA SEQ ID NO: 20 

AAATAGATCGCATAAGCAGTACGTG SEQ ID NO: 21 

TCATAGGCTGACAGTCCTAGCTAGT SEQ ID NO: 22 

TCGTAGACAGTACATGTCGATGAAT SEQ ID NO: 23 

TAACCGATCTAGTCGATCTACGACT SEQ ID NO: 24 
25 GTTTCGAGCTAGCTAAGAGACTCGT SEQ ID NO: 25 

CGTATTTCGACTGACTAG CCTCTAG SEQ ID NO: 26 

AGTTCGATCAGCTAACTCTGAGTCA SEQ ID NO: 27 

GCTATATCGATCGTC CATTAACGTA SEQ ID NO: 28 

Each fragment is separately sequenced with the primer TagR, 
30 GGGCAACCATGGCTAACC, (SEQ ID NO:29) by standard means. The ten standard 

phage are grown separately in liquid as above and equal numbers of phage are pooled. 

Single-stranded DNA is prepared from the pooled standards as above. 

The pool of sample-tagged phage DNA (3 ng) is combined with the pool of 
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standard phage DNA (0.3 ng). The combined pool is sequenced using a Sequenase kit 
(Amersham Pharmacia Biotech) and the primer TagR. Unlabeled dATP is substituted 
for 32 P-dATP in the manufacturer's protocol since these sequencing ladders will not be 
directly visualized after electrophoresis. The result is four collections of tagged 

5 termination products corresponding to terminal A, C, G, and T. 

A size standard is made by sequencing a separate aliquot of the phage DNA 
with a second primer M13gl, CTGAATCTTACCAACGCTAAC, (SEQ ID NO:30). 
This primer anneals to a sequence element in gene I far from the sample inserts, so it 
will produce an identical sequencing ladder for all the phage in the pool. This time 

10 32 P-dATP is incorporated in the sequencing reaction products. The four separate 
termination reactions are pooled in a 1:1:1:2 ratio (A:T:C:G). The excess "G" reaction 
simplifies alignment of different lanes after electrophoresis. 1 fil of this labeled size 
standard is added to 3 u.1 of each collection of tagged termination products. 

The four collections are electrophoresed at 40 V/cm in a standard 7M urea, 0.5 

15 X TBE, 0.4 mm thick sequencing gel with 0.5 cm lanes. The gel is dried onto 
Whatman 3MM paper. The size standard is visualized by autoradiography. The film is 
aligned with the dried gel and individual bands are excised as described in Example 1. 
The tagged reaction products in each gel slice (fraction) are electroeluted into a 
volume of 50 ul with the Electroelutor (Amika Corp, Columbia, MD and see Shukla, 

20 1994). 

The tagged reactions in each fraction are PCR amplified with two 
oligonucleotides: TagL, ATCCATCGAGACGGTCCA (SEQ ID NO:31) and 
TagR+biotin. TagR+biotin is identical in sequence to TagR and it is conjugated to 
biotin at the 5' end during oligonucleotide synthesis using the LC Biotin-ON 

25 Phosphoramidite (Clontech). 5 \i\ from each fraction is amplified in a 100 ul reaction 
with Taq polymerase (Promega) and PCR buffer according to the manufacturer's 
directions. Briefly, the primers are used at 1 with 0.2 mM dNTPs, 1.5 mM 
MgC12, PCR buffer and polymerase. The cycling parameters are as follows: 94°C, 
30s; 55°C, 30s; 72°C, 30s; 40 cycles. Prior to hybridization to the arrays, the PCR 

30 samples are denatured at 96°C for 5 min and cooled on ice for 5 min. 

Arrays of the 100,000 oligonucleotides (single- stranded) are synthesized with 
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parallel light-directed chemistry (Affymetrix, Santa Clara, CA has a custom array 
service). For details see Fodor et ah (1991 & 1995); Pease et aL (1994). Current 
technology allows fabrication of about 320,000 distinct oligonucleotides on a 1 .28cm 
x 1.28cm array; each oligonucleotide is present at about 10 7 copies in a 20um x 25 um 

5 "spot" (Wang et aL, 1998). The oligonucleotides are identical in sequence to the 
combinations allowed in Table I (read 5* to 3'). An additional 10 oligonucleotides are 
synthesized that correspond in sequence to the 10 standard sample tags. 

The arrays are hybridized with 6X SSPET (0.9M NaCl, 60mM NaH2P04 pH 
7.4, 6mM EDTA, 0.005% Triton X-100) for 5 minutes. 100 |al of 2X hybridization 

10 buffer (2X = 6M tetramethylammonium chloride, 20mM Tris-HCl pH 7.8, 2mM 
EDTA, 0.02% TritonX-100 with 200ug/ml sonicated herring sperm DNA (Promega)) 
is added to each denatured PCR sample from above for a final volume of 200 ul Each 
sample (fraction) is hybridized to one array for 15 hours at 44°C in a hybridization 
chamber (Afrymetrix) on a rotisserie at 40 rpm. The arrays are washed three times 

15 with IX SSPET and 10 times with 6X SSPET at 22°C. The hybridized biotinylated 
amplicons are then stained at room temperature with staining solution (streptavidin R- 
phycoerythrin (2 jig/ml, from Molecular Probes) and acetylated bovine serum albumin 
(0.5 mg/ml) in 6X SSPET) for 8 minutes, followed by 10 washes with 6X SSPET at 
22°C on a fluidics workstation (Affymetrix). The arrays are visualized with a confocal 

20 chip scanner (Hewlett-Packard/ Affymetrix) with a 560 nm filter. 

The digitized signals from each array are compared and any array to array 
hybridization variability is corrected by reference to the 10 known standard sequences. 
Sequence ladders are reconstructed from the hybridization patterns. 
7.3 Example 3 

25 This example describes a method for simultaneously generating about 37,000 

restriction maps. 

A pool of about 100,000 sample-tagged fosmid vectors is prepared by PCR 
amplifying the sample tags from the pool of phage vectors in Example 2 and cloning 
the collection into the fosmid vector pFOSl (Kim, U.J. et al, Nucl. Acids Res. 
30 20:1083-85 (1992)). DNA from the phage pool is amplified with two primers, TagR 
and CAACGTGGATCCATCGAGA (SEQ ID NO:32), in a PCR reaction using Pfu 
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polymerase (Stratagene) according to the manufacturer's instructions. The resulting 
amplicons comprise TagR, the variable sequences and TagL plus the BamH I site 
shown above. The amplicons are cut with BamH I and pFOSl is cut with BamH I and 
Srf I. The vector and sample tags are joined by ligation and transformed into the 

5 bacterial strain pop2136 (Kim et aL, 1992) by electroporation. Note both restriction 
sites are restored in the vector after ligation to the sample tags. About 10 million 
transformants are pooled and plasmid DNA is prepared with the Qiagen Plasmid 
Purification System (Qiagen). The plasmid DNA is prepared for cloning genomic 
DNA as described by Kim, et al (1992). The plasmid pool is linearized with Aat E, 

10 dephosphorylated with Alkaline Phosphatase and then digested with BamH I. 
Similarly, the ten standard sample tags described in Example 2 are separately cloned 
into pFOSl and plasmid DNA is prepared for cloning as above. 

A library is constructed in the pool of sample-tagged fosmid vectors. High 
molecular weight mouse DNA is partially digested with Mbo I to an average size of 

15 40 kb, treated with alkaline phosphatase, and ligated to the vector DNA prepared 
above as described by Kim, et aL (1992). The ligation mixture is packaged into 
lambda phage heads using the Gigapack III XL packaging extract (Stratagene). The 
packaged clones are transfected into strain DH5a-MCR (Gibco BRL). About 100,000 
clones are pooled and grown as a liquid culture in LB media. Plasmid DNA is purified 

20 from the pool with the Qiagen Large Construct Kit (Qiagen) according to the 
manufacturers instructions. Similarly, a random genomic fragment is cloned into each 
standard sample-tagged vector, the 10 standards are grown, plasmid DNA is purified 
as above and equal amounts of each standard are combined to make a standard pool.. 

The pool of sample-tagged DNA is combined with the pool of standards 

25 (10,000:1 mass ratio). The pooled DNA is linearized with Srf 1 and divided into four 
aliquots. A different double-stranded adapter is ligated to DNA in each aliquot. 
Excess adapters and salts are removed from the ligation reactions by electrodialysis 
with the electroelutor (Amika Corp., Columbia, MD). The adapter sequences are 
shown below: 

30 

5 ' -GCTCATTGCGGTAGCATACC Adapl SEQ ID NO: 33 

CATCGTATGG- 5 1 SEQ ID NO: 34 
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5 1 -GCGTGGCCTACTACGATTGT 



Adap2 



SEQ ID NO: 35 



GATGCTAACT- 5 



SEQ ID NO: 36 



5 



5 ' - GACGTAGCGAACTAGGGCAG 



Adap3 



SEQ ID NO: 37 



TGATCCCGTC-5 ■ 



SEQ ID NO: 38 



5 1 -GCAAGCAGCCTACGCATTAT 



Adap4 



SEQ ID NO: 39 



ATGCGTAATA- 5 1 



SEQ ID NO: 40 



10 



Each aliquot is subjected to partial restriction analysis with a different enzyme 
(EcoR I, Xba I, Nsi I or Bgl II). The partial digestion reaction conditions are first 
calibrated as follows. One of the ten standard clones is digested with Not I and end- 
labeled with 32 P-dGTP by standard means (see Ausubel et al., 1997). The labeled 

15 standard is digested with different concentrations of each enzyme. 10 \xg of the 
sample- tagged/standard DNA pool is linearized with Srf I, combined with about 10 ng 
of the end-labeled standard and incubated with the different enzyme concentrations at 
37°C for 15 minutes. The products are analyzed by agarose gel electrophoresis and 
visualized by autoradiography. The enzyme concentration is chosen that produces the 

20 most uniform distribution of fragments from the labeled standard. Now the 
appropriate enzyme concentration is used to partially digest the four pools of sample- 
tagged clones with adapters. 

The four partial digests are pooled and run in a single lane of a 32 cm, 0.8% 
agarose gel. The separated products are collected during electrophoresis onto anion 

25 exchange paper (NA-45, Schleicher & Schuell) using the GATC 1500 Direct Blotting 
Electrophoresis System (GATC GmbH; Konstanz, Germany) as described (Beck, 
1993). The paper is pulled along the bottom edge of the gel during electrophoresis at a 
constant speed of 10 cm/hr, and the voltage is adjusted so the largest fragments elute 
from the bottom of the gel after 6 hours. The 10 standard clones are analyzed 

30 separately to determine their partial digest patterns with the four enzymes. 

After electrophoresis, the blotting paper is sectioned at 2 mm intervals. Each 
section contains the DNA fragments that eluted from the bottom of the gel during a 
fixed time interval (1.2 min). Each section is washed in TE buffer (10 mM Tris pH 
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8.0, 1 mM EDTA), transferred to 50 ml elution buffer (2.5 M NaCl, 0.05 M arginine) 
and heated to 70°C for 1 hour. The eluted DNA is dialyzed against water with a 
Spectra/Por Microdialyzer (Fisher Scientific). 

Each sample (fraction) is PCR amplified in a 100 jil reaction with a mixture of 
five primers: TagL, JOE-Adapl, 5-FAM-Adap2, TAMRA-Adap3 and ROX-Adap4, 
where JOE, 5-FAM, TAMRA and ROX (PE Biosystems) are fluorescent labels 
attached to the 5' ends of Adapl, eta Amplification parameters are the same as 
Example 2. 

Each labeled amplicon is separately hybridized to the array of oligonucleotides 
described in Example 2. Array preparation and hybridization conditions are identical 
to those described above. The arrays are scanned with the ChipReader (Virtek) and 
signals from the four different fluorophores are digitized and analyzed. Variabilty in 
array to array hybridization signals are corrected by reference to the standards. The 
order and size of the tagged fragments (i.e. the restriction maps) are reconstructed 
from the hybridization patterns with reference to the standards. 
7.4 Example 4 

This example describes a method for simultaneously positioning about 17,600 
insertion elements. The insertion elements are essentially randomly inserted into the 
genome of Escherichia coli with the use of a transposon vector. 

pNK2859 is a plasmid that carries a mini-TnlO and a mutant transposase 
between two EcoR I restriction sites. The mini-TnlO consists of two 70 bp inverted 
repeats flanking a BamH I fragment that carries the kan R gene (kanamycin resistance) 
from Tn903 (Kleckner et ah, 1991). The mutant transposase eliminates the insertion 
site bias of the native protein. 

The plasmid pISl is made by inserting the following sequence at the BamH I 
site upstream of the kan R gene in pNK2859 : 

Inverted repeat GGATCCGCGGCCGCACGTGA 

Not I 

CTAGCATGGCCCGGGCGATCC(SEQ ID NO:41). . . kan R . . . 
Srfl 

pISl is cut with EcoR I. The fragment comprising the mini-TnlO and transposase is 
ligated into the single EcoR I site in the lambda "suicide" vector P am 80X (Kleckner et 
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ah, 1991) to make P am 80XISl. 

A pool of about 100,000 sample-tagged insertion element vectors is 
constructed by PCR amplifying the sample tags from the pool of phage vectors in 
Example 2 and cloning the collection between the Not I and Srf I sites in pISl. DNA 

5 from the phage pool is amplified with two primers (TagR and 
GTCAGCGGCCGCATCCATCGAGACGGTCCA SEQ ID NO:42) in a PCR 
reaction using Pfu polymerase (Stratagene) according to the manufacturer's 
instructions. The resulting amplicons comprise TagR, the variable sequences and 
TagL plus a Not I site. The amplicons are cut with Not I and P am 80MSl is cut with 

10 Not I and Srf I. The sample tags and vector are ligated together and packaged in vitro 
with the Gigapack III Gold Packaging Extract (Stratagene). The packaged vectors are 
plated on E. coli strain C600. About 10 million phage are pooled and amplified on 
C600. The sample tagged mini-TnlO elements are inserted into the chromosome of 
strain MG1655 (Blattner et al.,1997) according to the method described by Kleckner 

15 et ah (1991 ). Briefly, cells are infected with an equal number of phage, washed, grown 
for 1 hour in LB and plated on LB plates plus 2.5 mM sodium pyrophosphate and 30 
jig/ml kanamycin. The plates are incubated overnight at 37°C. Each colony usually 
contains a sample-tagged mini-TnlO inserted into the chromosome at a single, 
essentially random site. 

20 21,952 individual colonies are picked into separate wells of 28 grid plates. 

Each grid plate contains 784 wells in a 28x28 square grid pattern, and each well holds 
about 50 |il of liquid culture. The colonies are pooled in a simple 3-dimensional 
pattern. A 784-pin tool is used to transfer a few microliters from each well in a plate. 
The first 28 pools (i.e. the z-dimension) are made by pooling cells from all the wells 

25 in a single plate. The x and y dimensions are made by using a pad cut with 28 
"troughs". Each trough runs the length of a grid plate and is filled with LB. When the 
784-pin tool is placed on the pad, 28 pins reside in each trough. Using the 784-pin 
tool, a few microliters from each well of the 28 plates are transferred to the 28 
troughs, representing the x dimension (he. the columns). A second trough pad is 

30 oriented so the troughs are perpendicular to the first pad's orientation. Without 
changing the orientation of the plates, all the wells are transferred a second time to 
make the y-dimension (i^ the rows). The result is 28+28+28 = 84 pools and each 
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well is present in only 3 pools. 

DNA is prepared from overnight cultures of each pool. The sample tags are 
amplified from each pool by PCR with primers TagL and TagR, and hybridized to 
arrays of 100,000 tag complements as described in Example 2. The address of the 

5 cells containing each sample-tagged insertion element is determined from the 
hybridization patterns. About 80% (17,600) of the clones will contain sample tags 
with unique addresses, that is the sample tags are present in only one cell clone. 

To determine the chromosomal locations of the insertion elements, the sample- 
tagged junctions first are rescued from the chromosomal DNA. A single pool is made 

10 from the 21,952 separate bacterial clones. DNA is isolated from an overnight culture 
and the junctions are rescued by "Panhandle PCR" as described in detail by Jones 
(1995). Five primers are used in the method as shown below: 

AATTGGAATCAATAAAGCCCTGCG Adprimer SEQ ID NO: 43 

ACGACTGTGCTGGTCATTAAAC Primerl SEQ ID NO: 44 

15 TGATG AATGTT C CGTTGCG Primer2 SEQ ID NO: 45 

CGTATTCAGGCTGACCCTG Primer3 SEQ ID NO: 46 

CGCTGCCCGGATTACA Primer4 SEQ ID NO: 47 

The 5 primers hybridize to the mini-TnlO at sequences upstream of the sample tags 
and inverted repeat. The DNA from the single pool is cut to completion with Tsp509 I 
20 and then treated with alkaline phosphatase. Adprimer is phosphorylated m vitro with 
T4 kinase and then iigated to the cut pool DNA. The ligation mixture is denatured and 
then extended with Taq polymerase under conditions which allow the Iigated 
Adprimer to "loop back" and prime DNA synthesis into the mini-TnlO element. The 
resulting products are subjected to "nested PCR", first with Primerl and Primer2 and 
25 the second amplification is with Primer3 and Primer4. The end result is a pool of 
sample-tagged junctions with sequence elements in the following order: Primer4, 
sample tag, inverted repeat, junction, Adprimer and cPrimer3, where cPrimer3 is the 
complement of Primer3. 

Excess primers and salts are removed from the pool of sample-tagged 
30 junctions by electrodialysis with the electroelutor (Amika Corp., Columbia, MD). 

A set of ten sequencing standards is made by PCR amplifying the pool of 
standards described in Example 2 with two primers, M13uni 
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(TGTAAAACGACGGCCAGTG SEQ ID NO:48) and Ml 3 rev 
(CAGGAAACAGCTATGACCATGA SEQ ID NO:49). The pool of standards is 
combined with the pool of sample- tagged junctions (1 :2200 mass ratio). 

The pooled PCR products are sequenced with TagR using the T7 Sequenase 

5 PCR Product Sequencing Kit (Amersham Pharmacia Biotech) according to the 
manufacturer's instructions with the exception that unlabeled dATP is substituted for 
32 P-dATP. The reaction products are processed in parallel and the sequences of the 
sample-tagged junctions are determined as described in Example 2. Now the sequence 
of each sample-tagged junction is known therefore the location of the insertion 

10 element is known by comparison to the complete sequence of E. coli (Blattner et aL, 
1997). About 12 bases of sequence are required to uniquely identify the location of an 
insertion element in E. coH. Greater than 95% of the insertion elements will be 
situated 12 base pairs or more from a Tsp509 I site, so about 16,700 of the rescued 
sample-tagged junctions will contain enough genomic sequence to pinpoint their 

15 locations. The cells containing any sample- tagged insertion element can be easily 
recovered by reference to the well address of the sample tag. 

The present invention is not to be limited in scope by the exemplified 
embodiments which are intended as illustrations of single aspects of the invention, 
and methods which are functionally equivalent are within the scope of the invention. 

20 Indeed, various modifications of the invention in addition to those described herein 
will become apparent to those skilled in the art from the foregoing description and the 
accompanying Figures and Drawings. Such modifications are intended to fall within 
the scope of the appended claims. 
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CLAIMS 

I claim: 

1 . A parallel method for polynucleotide sequencing, comprising: 



2 


a) 


preparing a library compnsing a collection or two or more sampie- 


3 




tagged polynucleotide clones; 


4 


b) 


carrying out a nucleic acid sequencing reaction on the library wherein a 


5 




first sequencing primer binding site or no sequencing primer binding 


6 




site is used to generate a plurality of tagged reaction products from the 


7 




sample-tagged clones in the collection; 


8 


c) 


separating the reaction products according to size; 


9 


d) 


collecting fractions of the separated reaction products; 


10 


e) 


amplifying the products collected in step (d) to generate tagged 


11 




amplicons; 


12 


0 


hybridizing the tagged amplicons to an array comprising tag 


13 




complements; and 


14 


g) 


determining a plurality of polynucleotide sequences of the sample- 


15 




tagged clones by detecting the hybridizations to deconvolute a plurality 


16 




of sequence ladders for the sample-tagged clones in the collection. 



1 2. The method of claim 1 , wherein the library further comprises a second 

2 collection of sample-tagged clones, and a second plurality of tagged sequencing 

3 reaction products is generated from each clone in the second collection using a second 

4 sequencing primer binding site. 

1 3. The method of claim 1 , wherein the collection comprises a pool of the sample- 

2 tagged clones. 

1 4. The method of claim 3, wherein the library preparation step (a) comprises 

2 pooling a first pool comprising sample-tagged polynucleotides with a second pool 

3 comprising sample polynucleotides, carrying out a reaction to join polynucleotides 

4 from the first pool with polynucleotides from the second pool, and cloning the 

5 reaction products. 

I 5. The method of claim 4, wherein the sample-tagged polynucleotides in the first 
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2 pool are cloning vectors. 

1 6. The method of claim 1 , wherein after the hybridization step (f) the tagged 

2 amplicons are amplified in-situ. 

1 7. The method of claim 1 , wherein the sequencing reaction does not comprise the 

2 use of sequencing primers. 

1 8. The method of claim 7, wherein the sequencing reaction comprises chemical 

2 cleavage of the sample-tagged clones. 

1 9. The method of claim 1 , wherein the sequencing reaction comprises a plurality 

2 of reactions that are carried out on separate aliquots of the library to generate one pool 

3 of the tagged reaction products from each aliquot. 

1 1 0. The method of claim 9, wherein the pools of the tagged reaction products are 

2 independently separated in the separation step (c). 

1 11. The method of claim 9, wherein at least two of the pools of the tagged reaction 

2 products are pooled before the separation step (c). 

1 1 2. The method of claim 1 1 , wherein the tagged reaction products comprise tags 

2 that identify the pools. 

1 13. The method of claim 1 , wherein the tagged reaction products are pooled before 

2 the separation step. 

1 14. The method of claim 1 , wherein the sample-tagged clones comprise genomic 

2 tags. 

1 15. The method of claim 1 , wherein the sample-tagged clones comprise adapter 

2 tags. 

1 1 6. The method of claim 1 5, wherein each sample-tagged clone comprises a 

2 distinct sequence element, two common sequence elements and a sample sequence 

3 element such that the order of elements in the polynucleotide is the first common 

4 element, the distinct element, the second common element and the sample sequence 

5 element. 

1 1 7. The method of claim 1 6, wherein the tagged amplicons comprise the distinct 

2 sequence elements. 

1 18. The method of claim 1 7, wherein the tag complements hybridize to the distinct 

2 sequence elements. 

1 19. The method of claim 1 8, wherein the tag complements and the distinct 
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2 sequence elements form perfectly matched duplexes. 

1 20. The method of claim 16, wherein the first common sequence element 

2 comprises a sequencing primer binding site. 

1 21 . The method of claim 16, wherein the second common sequence element 

2 comprises a primer binding site for use in the amplification step (e). 

1 22. The method of claim 1 , wherein the separation step (c) comprises gel 

2 electrophoresis. 

1 23. The method of claim 1, wherein the amplification step (e) comprises a 

2 polymerase chain reaction. 

1 24. The method of claim 1 , wherein the amplification step comprises the use of an 

2 RNA polymerase. 

1 25. The method of claim 1, wherein the collection comprises greater than 100 

2 sample-tagged polynucleotide clones. 

! 26. The method of claim 25, wherein the collection comprises greater than 1 000 

2 sample-tagged polynucleotide clones. 

1 27. The method of claim 26, wherein the collection comprises greater than 

2 1 00,000 sample-tagged polynucleotide clones. 

1 28. A method for constructing a recombinant molecule, comprising: sequencing a 

2 polynucleotide according to the method of claim 1, identifying a homo log of the 

3 polynucleotide, and joining a sequence element from the homo log to a vector. 

1 29. A method for producing a polypeptide, comprising: constructing a 

2 recombinant molecule according to the method of claim 28 wherein the vector is an 

3 expression vector, and transferring the recombinant molecule to a host. 

1 30. A method for producing a polypeptide of known sequence, comprising: 

2 sequencing a polynucleotide according to the method of claim 1, wherein the sample- 

3 tagged clones are contained in an expression vector, and transferring the 

4 polynucleotide to a host. 

1 31. A method for constructing a database of genetic information, comprising: 

2 sequencing polynucleotides according to the method of claim 1 to generate genetic 

3 information, and storing the genetic information in a database. 

1 32. A method for identifying polymorphisms, comprising: sequencing 

2 polynucleotides according to the method of claim 1 , and comparing the sequences to a 



4 
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3 database of sequences to identify the polymorphisms. 

1 33. A method for identifying a gene associated with a phenotype, comprising: 

2 identifying polymorphisms according to the method of claim 32, assaying a 

3 distribution of the polymorphisms in a population to identify a subset of the 
polymorphisms that associate with the phenotype, and identifying the gene linked to 

5 the subset of polymorphisms. 

l 34. A method for identifying a compound that interacts with a gene product, 
comprising: identifying a gene according to the method of claim 33, identifying a 
homolog of the gene, preparing the gene product encoded in the homolog, providing 
compounds, assaying the interaction between the compounds and the gene product, 
and identifying the compound that interacts with the gene product. 

35. A method for identifying a compound that interacts with a polypeptide, 
comprising: sequencing a polynucleotide according to the method of claim 1, 
identifying a homolog of the polynucleotide, producing the polypeptide encoded in the 
homolog, providing compounds, assaying the interaction between the compounds and 
the polypeptide, and identifying the compound that interacts with the polypeptide. 

36. A method for identifying a compound that modulates expression of a gene, 
comprising: sequencing a polynucleotide according to the method of claim 1, 
identifying a homolog of the polynucleotide, providing compounds, assaying the 
ability of the compounds to affect the expression of the homolog, and identifying the 
compound that modulates the expression. 

37. A method for preparing antibodies to a polypeptide, comprising: sequencing a 
polynucleotide according to the method of claim 1, identifying a homolog of the 
polynucleotide, and preparing antibodies to the polypeptide encoded by the homolog. 

38. A method for preparing an array of oligonucleotides for assaying 
polynucleotides of known sequence, comprising: sequencing polynucleotides 
according to the method of claim 1, and preparing the array of oligonucleotides that 
hybridize to the polynucleotides. 

39. A method for identifying a gene associated with a phenotype, comprising: 
locating the gene to within a defined genomic region, and sequencing the region 
according to the method of claim I . 

40. A method for producing a transgenic cell, comprising: sequencing a 
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2 polynucleotide according to the method of claim 1, identifying a homolog of the 

3 polynucleotide, and creating a transgenic cell that expresses the homolog. 

1 41. A method for producing a cell with a mutated gene, comprising: sequencing a 

2 polynucleotide according to the method of claim 1, identifying a homolog of the 

3 polynucleotide, identifying a gene corresponding to the homolog, designing a 

4 mutation in the gene, and introducing the mutation into the gene in the cell. 

1 42. A method for assaying gene expression, comprising: sequencing 

2 polynucleotides according to the method of claim 1, preparing probes to the 

3 polynucleotides, and using the probes to identify expression levels of the 

4 polynucleotides. 

1 43. A method for identifying a bioactive polypeptide, comprising: sequencing a 

2 polynucleotide according to the method of claim I, identifying a homolog of the 

3 polynucleotide, producing a polypeptide encoded in the homolog, and assaying the 

4 polypeptide for biological activity to identify the bioactive polypeptide. 
1 44. A parallel method for polynucleotide sequencing, comprising: 



2 


a) 


preparing a library comprising sample-tagged polynucleotide clones; 


3 


b) 


carrying out a nucleic acid sequencing reaction on the library to 


4 




generate a plurality of tagged reaction products; 


5 


c) 


separating the reaction products according to size; 


6 


d) 


collecting fractions of the separated reaction products; 


7 


e) 


hybridizing the products collected in step (d) to an array comprising tag 


8 




complements; 


9 


f) 


amplifying the hybridized reaction products in situ; and 


10 


g) 


determining a plurality of polynucleotide sequences of the sample- 


11 




tagged clones by detecting the amplified reaction products to 


12 




deconvolute a plurality of sequence ladders for the sample-tagged 


13 




clones. 



1 45. The method of claim 44, wherein the amplification step (0 comprises the use 

2 of an RNA polymerase. 

1 46. The method of claim 44, wherein the amplification step (0 comprises a 

2 rolling-circle type amplification. 

1 47. A method for constructing a recombinant molecule, comprising: sequencing a 
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polynucleotide according to the method of claim 44, identifying a homolog of the 
polynucleotide, and joining a sequence element from the homolog to a vector. 

48. A method for producing a polypeptide, comprising: constructing a 
recombinant molecule according to the method of claim 47 wherein the vector is an 
expression vector, and transferring the recombinant molecule to a host. 

49. A method for producing a polypeptide of known sequence, comprising: 
sequencing a polynucleotide according to the method of claim 44, wherein the 
sample-tagged clones are contained in an expression vector, and transferring the 
polynucleotide to a host. 

50. A method for constructing a database of genetic information, comprising: 
sequencing polynucleotides according to the method of claim 44 to generate genetic 
information, and storing the genetic information in a database. 

51. A method for identifying polymorphisms, comprising: sequencing 
polynucleotides according to the method of claim 44, and comparing the sequences to 
a database of sequences to identify the polymorphisms. 

52. A method for identifying a gene associated with a phenotype, comprising: 
identifying polymorphisms according to the method of claim 51, assaying a 
distribution of the polymorphisms in a population to discover a subset of the 
polymorphisms that associate with the phenotype, and identifying the gene linked to 
the subset of polymorphisms. 

53. A method for identifying a compound that interacts with a gene product, 
comprising: identifying a gene according to the method of claim 52, identifying a 
homolog of the gene, preparing the gene product encoded in the homolog, providing 
compounds, assaying the interaction between the compounds and the gene product, 
and identifying the compound that interacts with the gene product. 

54. A method for identifying a compound that interacts with a polypeptide, 
comprising: sequencing a polynucleotide according to the method of claim 44, 
identifying a homolog of the polynucleotide, identifying the polypeptide encoded in 
the polynucleotide, providing compounds, assaying the interaction between the 
compounds and the polypeptide, and identifying the compound that interacts with the 
polypeptide. 

55. A method for identifying a compound that modulates expression of a gene, 
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2 comprising: sequencing a polynucleotide according to the method of claim 44, 

3 identifying a homolog of the polynucleotide, providing compounds, assaying the 

4 ability of the compounds to affect the expression of the homolog, and identifying the 

5 compound that modulates the expression. 

1 56. A method for preparing antibodies to a polypeptide, comprising: sequencing a 

2 polynucleotide according to the method of claim 44, identifying a homolog of the 

3 polynucleotide, and preparing antibodies to the polypeptide encoded by the homolog. 

1 57. A method for preparing an array of oligonucleotides for assaying 

2 polynucleotides of known sequence, comprising: sequencing polynucleotides 

3 according to the method of claim 44, and preparing the array of oligonucleotides that 

4 hybridize to the polynucleotides. 

1 58. A method for identifying a gene associated with a phenotype, comprising: 

2 locating the gene to within a defined genomic region, and sequencing the region 

3 according to the method of claim 44. 

1 59. A method for producing a transgenic cell, comprising: sequencing a 

2 polynucleotide according to the method of claim 44, identifying a homolog of the 

3 polynucleotide, and creating a transgenic cell that expresses the homolog. 

1 60. A method for producing a cell with a mutated gene, comprising: sequencing a 

2 polynucleotide according to the method of claim 44, identifying a homolog of the 

3 polynucleotide, identifying a gene corresponding to the homolog, designing a 

4 mutation in the gene, and introducing the mutation into the gene in the cell. 

1 61 . A method for assaying gene expression, comprising: sequencing 

2 polynucleotides according to the method of claim 44, preparing probes to the 

3 polynucleotides, and using the probes to identify expression levels of the 

4 polynucleotides. 

1 62. A method for identifying a bioactive polypeptide, comprising: sequencing a 

2 polynucleotide according to the method of claim 1, identifying a homolog of the 

3 polynucleotide, producing a polypeptide encoded in the homolog, and assaying the 

4 polypeptide for biological activity to identify the bioactive polypeptide. 

1 63. A parallel method for physical mapping, comprising: 

2 a) preparing a library comprising a collection of sample-tagged 

3 polynucleotide clones wherein each clone comprises a plurality of 



4 




5 


b) 


6 




7 




8 


c) 


9 


d) 


10 


e) 


11 




12 


0 


13 




14 


g) 


15 




16 
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landmarks; 

carrying out a cleavage reaction on the library to cleave the sample- 
tagged clones at the landmarks to generate from each clone a plurality 
of tagged cleavage products of different sizes; 
separating the cleavage products according to size; 
collecting fractions of the separated cleavage products; 
amplifying the products collected in step (d) to generate tagged 
amplicons; 

hybridizing the tagged amplicons to an array comprising tag 
complements; and 

determining a plurality of physical maps by detecting the 
hybridizations to deconvolute the locations of the landmarks in the 
sample-tagged clones. 

1 64. The method of claim 63, wherein the landmarks are restriction sites and the 

2 cleavage reaction comprises a partial digestion with a restriction enzyme. 

1 65. The method of claim 63, wherein the sample-tagged polynucleotide clones 

2 comprise subclones of a large polynucleotide and after step (g), constructing a 

3 physical map of the large polynucleotide from overlapping physical maps of the 

4 subclones. 

1 66. The method of claim 63, wherein step (b) comprises performing a plurality of 

2 cleavage reactions on the library wherein each cleavage reaction cleaves the sample- 

3 tagged clones at different landmarks to generate one pool of the tagged cleavage 

4 products for each cleavage reaction. 

1 67. The method of claim 66, wherein the pools of the tagged cleavage products are 

2 independently separated in the separation step (c). 

1 68. The method of claim 66, wherein at least two of the pools of the tagged 

2 cleavage products are pooled before the separation step (c). 

1 69. The method of claim 68, wherein the tagged cleavage products comprise tags 

2 that identify the pools. 

1 70. The method of claim 63, wherein the library comprises a pool of the sample- 

2 tagged clones. 
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1 71 . The method of claim 70, wherein preparing the library comprises pooling a 

2 first pool comprising sample-tagged polynucleotides with a second pool comprising 

3 sample polynucleotides, carrying out a reaction to join polynucleotides from the first 

4 pool with polynucleotides from the second pool, and cloning the reaction products. 

1 72. The method of claim 63, wherein the sample-tagged clones comprise genomic 

2 tags. 

1 73. The method of claim 63, wherein the sample-tagged clones comprise adapter 

2 tags. 

1 74. The method of claim 73, wherein each sample-tagged clone comprises a 

2 distinct sequence element, two common sequence elements and a sample sequence 

3 element such that the order of elements in the polynucleotide is the first common 

4 element, the distinct element, the second common element and the sample sequence 

5 element. 

1 75. The method of claim 74, wherein the tagged amplicons comprise the distinct 

2 sequence elements. 

1 76. The method of claim 75, wherein the tag complements hybridize to the distinct 

2 sequence elements. 

1 77. The method of claim 76, wherein the tag complements and the distinct 

2 sequence elements form perfectly matched duplexes 

1 78. The method of claim 63, wherein after the hybridization step (f) the amplicons 

2 are amplified in situ . 

1 79. The method of claim 63, wherein the amplification step (e) comprises a 

2 polymerase chain reaction. 

1 80. A method for identifying polymorphisms, comprising: generating physical 

2 maps of polynucleotides according to the method of claim 63, comparing the physical 

3 maps to a database of physical maps, and identifying differences in the landmarks to 

4 identify the polymorphisms. 

1 81 . A method for identifying a gene associated with a phenotype, comprising: 

2 identifying polymorphisms according to the method of claim 80, assaying the 

3 distribution of the polymorphisms in a population, and identifying a subset of the 

4 polymorphisms that associate with the phenotype to identify the gene. 

l 82. A method for identifying a compound that interacts with a gene product, 
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2 comprising: identifying a gene according to the method of claim 81, identifying a 

3 homolog of the gene, preparing the gene product encoded in the homolog, providing 

4 compounds, assaying the interaction between the compounds and the gene product, 

5 and identifying the compound that interacts with the gene product. 

1 83. A method for locating genomic rearrangements, comprising: generating 

2 physical maps of polynucleotides according to the method of claim 63, comparing the 

3 physical maps to a database of physical maps, and identifying differences in the 

4 landmarks to locate the genomic rearrangements. 

1 84. The method of claim 83, wherein the polynucleotides are isolated from a 

2 diseased tissue. 

1 85. A method for identifying a gene associated with a disease, comprising: 

2 locating a genomic rearrangement according to the method of claim 84, and 

3 identifying the gene near the rearrangement. 

1 86. A method for identifying a compound that interacts with a gene product, 

2 comprising: identifying a gene according to the method of claim 85, identifying a 

3 homolog of the gene, preparing the gene product encoded in the homolog, providing 

4 compounds, assaying the interaction between the compounds and the gene product, 

5 and identifying the compound that interacts with the gene product. 

1 87. A parallel method for physical mapping, comprising: 

2 a) preparing a library comprising a collection of sample-tagged 

3 polynucleotide clones wherein each clone comprises a plurality of 

4 landmarks; 

5 b) carrying out a cleavage reaction on the library to cleave the sample- 

6 tagged clones at the landmarks to generate from each clone a plurality 

7 of tagged cleavage products of different sizes; 

8 c) separating the cleavage products according to size; 

9 d) collecting fractions of the separated cleavage products; 

10 e) hybridizing the products collected in step (d) to an array comprising tag 

1 1 complements; and 

12 f) determining a plurality of physical maps by detecting the 

13 hybridizations to deconvolute the locations of the landmarks in the 

14 sample-tagged clones. 
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1 88. The method of claim 87, wherein the hybridized cleavage products are 

2 amplified in situ before detection. 

1 89. The method of claim 88, wherein the in situ amplification comprises the use of 

2 an RNA polymerase. 

1 90. A method for identifying polymorphisms, comprising: generating physical 

2 maps of polynucleotides according to the method of claim 87, comparing the physical 

3 maps to a database of physical maps, and identifying differences in the landmarks to 

4 identify the polymorphisms. 

1 91 . A method for identifying a gene associated with a phenotype, comprising: 

2 identifying polymorphisms according to the method of claim 90, assaying the 

3 distribution of the polymorphisms in a population, and identifying a subset of the 

4 polymorphisms that associate with the phenotype to identify the gene. 

1 92. A method for identifying a compound that interacts with a gene product, 

2 comprising: identifying a gene according to the method of claim 91, identifying a 

3 homolog of the gene, preparing the gene product encoded in the homolog, providing 

4 compounds, assaying the interaction between the compounds and the gene product, 

5 and identifying the compound that interacts with the gene product. 

1 93. A method for locating genomic rearrangements, comprising: generating 

2 physical maps of polynucleotides according to the method of claim 87, comparing the 

3 physical maps to a database of physical maps, and identifying differences in the 

4 landmarks to locate the genomic rearrangements. 

1 94. The method of claim 93, wherein the polynucleotides are isolated from a 

2 diseased tissue. 

1 95. A method for identifying a gene associated with a disease, comprising: 

2 locating a genomic rearrangement according to the method of claim 94, and 

3 identifying the gene near the rearrangement. 

1 96. A method for identifying a compound that interacts with a gene product, 

2 comprising: identifying a gene according to the method of claim 95, identifying a 

3 homolog of the gene, preparing the gene product encoded in the homolog, providing 

4 compounds, assaying the interaction between the compounds and the gene product, 

5 and identifying the compound that interacts with the gene product. 

1 97. A parallel method for producing cells containing located insertion elements, 
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2 comprising; 

3 a) producing cells comprising insertion elements integrated into a 

4 plurality of locations; 

5 b) preparing a library of polynucleotide clones comprising sample-tagged 

6 junctions from the insertion elements; 

7 c) carrying out a mapping reaction on the library to generate tagged 

8 reaction products; and 

9 d) identifying the locations of the insertion elements by associating the 

10 reaction products with an array of tag complements to deconvolute 

1 1 maps of the junctions. 

1 98. The method of claim 97, wherein the library comprises a pool of the 

2 polynucleotide clones. 

1 99. The method of claim 97, wherein the cells comprise the sample- tagged 

2 junctions. 

1 100. The method of claim 99, wherein the insertion elements comprise sample tags. 

1 101. The method of claim 99, wherein the sample-tagged junctions comprise 

2 genomic tags. 

1 1 02. The method of claim 97, wherein the sample-tagged junctions comprise 

2 adapter tags. 

1 1 03. The method of claim 97, further comprising maintaining separate collections 

2 of cells wherein each collection comprises a subset of the locations of insertion 

3 elements, and identifying the subset in each collection. 

1 104. The method of claim 103, wherein the separate collections comprise separate 

2 clonal populations of cells. 

1 1 05 . The method of claim 1 03 , wherein the cells comprise the sample-tagged 

2 junctions and the step of identifying the subset comprises pooling the collections to 

3 generate a plurality of subpools, amplifying polynucleotides from each subpool to 

4 generate tagged amplicons, and hybridizing the amplicons to an array of the tag 

5 complements. 

1 1 06. The method of claim 1 05, wherein each collection is present in a unique 

2 combination of subpools. 

1 1 07. The method of claim 1 03, wherein the locations comprise sequences from the 
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2 junctions, and the step of identifying the subset comprises pooling the collections to 

3 generate a plurality of subpools; amplifying polynucleotides from each subpool to 

4 generate amplicons comprising the junctions; and hybridizing the amplicons to an 

5 array comprising polynucleotides that are complementary to the junction sequences. 

1 108. The method of claim 107, wherein each collection is present in a unique 

2 combination of subpools. 

1 109. The method of claim 97, further comprising before step (c) hybridizing the 

2 clones to the tag complements to generate an array comprising tagged clones. 

1 110. The method of claim 109, wherein the mapping reaction is an array sequencing 

2 reaction. 

111. The method of claim 97, wherein the step of associating (d) comprises 



2 hybridizing the tagged reaction products to tag complements. 



1 112. The method of claim 111, further comprising separating the tagged reaction 

2 products by size and collecting fractions of the separated products before the 

3 hybridization step. 

1 113. The method of claim 97, further comprising amplifying the tagged reaction 

2 products to generate tagged amplicons; and wherein the step of associating (d) 

3 comprises hybridizing the amplicons to tag complements. 

1 114. The method of claim 113, further comprising separating the tagged reaction 

2 products by size and collecting fractions of the separated products before the 

3 amplification step. 

1 115. The method of claim 97, wherein the mapping reaction comprises a 

2 sequencing method. 

1 116. The method of claim 97, wherein the sequencing reaction comprises a nucleic 

2 acid sequencing reaction. 

1 117. A method for producing a transgenic organism, comprising: producing cells 

2 according to the method of claim 97, producing daughter cells, and cloning the 

3 organism from the cells. 

1 118. A method for identifying a bioactive compound, comprising: producing a 

2 transgenic organism according to the method of claim 1 17, providing compounds, 

3 assaying the compounds on the organism, and identifying the bioactive compound. 

1 119. A method for identifying a bioactive compound, comprising: producing cells 
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2 according to the method of claim 97, producing daughter cells, providing compounds, 

3 assaying the compounds on the cells, and identifying the bioactive compound. 
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SEQUENCE LISTING 

<1 10> Strathmann. Michael P 

<120> Parallel Methods for Genomic Analysis 

<130> 20946-70 1PCT 

<140> Not yet assigned 
<141> 

<150> US 60/105,914 
<151> 1998-10-28 

< 1 50> To be assigned 
<151> 1999-10-26 

<160> 49 

<170> Patentln Ver. 2.0 

<210> 1 
<21 1> 33 
<212> DNA 

<2 13> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 1 

cagcaccagg aaggtggcca ggttggcagt gta 33 

<210>2 
<211>33 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 2 

cctagctctc ngaagtcat cggccagggt gga 33 

<210>3 
<211> 26 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 3 

atcaagctta tggatcccgt cgacct 26 

<210>4 
<211>33 
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<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 4 

ggtgctcgtg tctttatcgt ccctacgtct ctt 33 

<210>5 
<211>28 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequencettag 
<400> 5 

aatrttgaag nagctttga ttccattc 28 

<210>6 
<2Ii>33 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 6 

GGCGTCCTGC TGCAGTCTGG CATTGGGGAA 30 

<210>7 
<211>27 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 



<210>8 
<211>33 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequencertag 
<400> 8 

gatgaactat acaagcttat gtccagactt cca 33 

<210>9 

<211>27 

<212>DNA 



<400> 7 

attgaagatg gaggcgttca actagca 



27 
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<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 9 

aagggcagat tggtaggaca ggtaatg 27 

<210> 10 
<2U>23 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 10 . 

ccgtcgggca tccgcgcctt gag 23 

<210> 11 
<211>33 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence:tag 
<400> 1 1 

tacattgtgt gagttgaagt tgtattccaa ttt 33 

<210> 12 
<211>95 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Anificial Sequence: sample tag 
<400> 12 

atttaggtga cactatagaa ctcgaccagt acattgtgtg agttgaagtt gtattccaat 60 
ttctgaagct tgcatgcctg caggtcgact ctaga 95 

<210> 13 
<211> 13 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence: vector 
poly linker 

<400> 13 

gatctgccgg tct 13 

<210> 14 
<211>24 
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<212> DNA 

<2 1 3> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 14 

ggtgacacta tagaactcga gcag 24 

<210> 15 
<211> 18 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 15 

caggcatgca agcttcag 18 

<210> 16 
<211>20 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 16 

cactatcgac tacgcgatca 20 

<210> 17 
<2ll> 113 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: vector 
polylinker 

<400>17 

gaattccatg ttgttggggc gcgcctccat caacgtggat ccatcgagac ggtccagagc 60 
tcagtggcgc atgcaatgct ccaactgcag gttagccatg gttgcccaag ctt 113 

<210> 18 
<211> 12 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 18 

tgagtcacca ac 1 2 

<210> 19 
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<211>25 
<212> DNA 

<2I3> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 19 

tcaatcgact acactcgtaa caaga 25 

<210> 20 
<21l>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 20 

gatcaattcg ctaatcgatc gtata 25 

<210>21 
<211>25 
<212> DNA 

<2 1 3> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400>21 

aaatagatcg cataagcagt acgtg 25 

<210> 22 
<211>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 22 

tcataggctg acagtcctag ctagt 25 

<210>23 
<2U>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 23 

tcgtagacag tacatgtcga tgaat 25 

<210> 24 
<2U>25 
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<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 24 

taaccgatct agtcgatcta cgact 25 

<210>25 
<2U>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 25 

gtttcgagct acctaagaga ctcgt 25 

<210> 26 
<211>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 26 

cgtatttcga ctgactagcc tctag 25 

<210> 27 
<211>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 27 

agttcgatca gctaactctg agtca 25 

<210>28 
<211>25 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: tag 
<400> 28 

gctatatcga tcgtccatta acgta 25 

<210> 29 
<211> 18 
<212> DNA 
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<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 29 

gggcaaccat ggctaacc * & 

<210> 30 
<211>21 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 30 

ctgaatctta ccaacgctaa c 2 1 

<210>31 
<211> 18 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400>31 

atccatcgag acggtcca 18 

<210>32 
<211> 19 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 32 

caacgtggat ccatcgaga 1 

<210> 33 
<21I>20 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 33 

gctcattgcg gtagcatacc 2 

<210>34 
<211> 10 
<212>DNA 

<213> Artificial Sequence 
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<220> 

<223> Description of Artificial Sequence: adapter 
<400> 34 



<210> 35 
<211>20 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 35 

gcgtggccta ctacgattgt 20 

<210> 36 
<2U> 10 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Anificial Sequence: adapter 
<400> 36 

tcaatcgtag 10 

<210> 37 
<211>20 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 37 

gacgtagcga actagggcag ^ 

<210> 38 
<211> 10 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 38 

ctgccctagt 10 

<210>39 
<211>20 
<212> DNA 

<213> Artificial Sequence 



ggtatgctac 



10 
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<220> 

<223> Description of Artificial Sequence: adapter 
<400> 39 

gcaagcagcc tacgcattat 20 

<210>40 
<211> 10 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: adapter 
<400> 40 

ataatgcgta 10 

<210>41 
<211>41 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: polylinker 
<400>41 

ggatccgcgg ccgcacgtga ctagcatggc ccgggcgatc c 

<210>42 
<211>30 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 42 

gtcagcggcc gcatccatcg agacggtcca 

<210> 43 
<211>24 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 43 

aattggaatc aataaagccc tgcg 2 

<210>44 
<211>22 
<212>DNA 

<213> Artificial Sequence 



<220> 
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<223> Description of Anifxcial Sequence: primer 

<400> 44 

acgactgtgc tggtcattaa ac 22 

<210> 45 
<211> 19 
<212> DNA 

<213> Anificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 45 

tgatgaatgt tccgttgcg 19 

<210>46 
<2il> 19 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 46 

cgtattcagg ctgaccctg 19 

<210>47 
<211> 16 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
<400> 47 

cgctgcccgg attaca 16 

<210>48 
<211> 19 
<212> DNA 

<213> Artificial Sequence 
<.220> 

<223> Description of Artificial Sequence: primer 
<400> 48 

tgtaaaacga cggccagtg 1 9 

<210> 49 
<211>22 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: primer 
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<400> 49 

caggaaacag ctatgaccat ga 
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The present invention provides parallel 
methods for determining nucleotide sequences 
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